Start a Conversation

Unsolved

This post is more than 5 years old

1906

July 10th, 2016 10:00

Poor write performance?

I have created a 7 node ScaleIO system using 2.01, however I am seeing fairly write performance on the volumes. Here is a little detail on the SDS nodes configuration:

Dual Intel Xeon L5420

16GB RAM

Running CentOS 7.2

2x 4TB 7200 RPM SATA

1x 3TB 7200 RPM SATA

QDR Infiniband Controller running IPoIB in connected mode

Network results seem decent at almost full 10G:

iperf results:

iperf -c 172.18.0.41

------------------------------------------------------------

Client connecting to 172.18.0.41, TCP port 5001

TCP window size: 2.50 MByte (default)

------------------------------------------------------------

[  3] local 172.18.0.42 port 35748 connected with 172.18.0.41 port 5001

[ ID] Interval       Transfer     Bandwidth

[  3]  0.0-10.0 sec  11.6 GBytes  9.97 Gbits/sec

start_sds_network_test results:

SDS with IP 172.18.0.41 (port 7072) returned information on 9 SDSs

    SDS a3787adf00000000 172.18.0.44 bandwidth 914.3 MB (936228 KB) per-second

    SDS a378a1ef00000002 172.18.0.47 bandwidth 1.1 GB (1101 MB) per-second

    SDS a3787ae000000003 172.18.0.45 bandwidth 994.2 MB (1018034 KB) per-second

    SDS a37853d100000008 172.18.0.43 bandwidth 882.8 MB (903944 KB) per-second

    SDS a3787ae10000000a 172.18.0.46 bandwidth 1013.9 MB (1038194 KB) per-second

    SDS a378a1f30000000e 172.18.0.42 bandwidth 882.8 MB (903944 KB) per-second

However, inside of a VM running crystal disk mark mapped to a volume:

cdm.PNG.png

Reads seem to maxing out the results from the sds_network_test and iperf, however the writes seem to be performing terribly. I would expect to see something like 200-400MB/s at a minimum. I do realize that the drives are 7200 spinning rust, however running the same benchmark on a single drive of this type connected directly to a machine, you see higher write performance than this.

Any suggestions/recommendations?

306 Posts

July 11th, 2016 11:00

Hi Jeff,

Can you check the messages files on the SDS's for any disk errors? Also, are all the devices healthy in the ScaleIO GUI?

Are you using some kind of a RAID controller or they are just disks attached to standard SATA controller onboard?

Can you show use 'scli --query_all" output as well?

Thanks,

Pawel

January 9th, 2017 23:00

Hey Jeff,

I haven't done much tuning with spinning disks but I have worked with infiniband (QDR and FDR).  I have found two things that affect write performance significantly. 

First is the type of storage cards. 

Are you using HBA's or RAID cards?  (please give manufacture and model) 

If RAID, do you have the battery power packs connected with write cach on or are they in pass through?

Second is latency between nodes

  • Can you run the built in latency tests in scaleIO and post the results? (please run on each node)
    • If you are getting anything over 100 μs using infiniband is pretty bad.  I have gotten 30 μs after tuning a QDR fabric with 26GBits throughput per port
    • Mellanox has a good set of tools that you can use in addition to iPerf and the build in scaleIO tests. 
  • What version firmware are you running on your HCA's?
  • Are your HCA's a dell/hp brand cards or true mellonox?  If they are dell/hp the firmware is a pain to upgrade.
  • Are you using a managed infiniband switch?  I not, how is your seperate manager configured.
    • If using unmanaged switches it might be worth while to ebay them and invest in managed ones. 

Below are some settings I have found to be very effective at increasing performance.

  1. IPoIB can only handle a 4k MTU.  You have to account for the 4 byte header so set it to 4092 NOT 4096
  2. Increase the io buffers on the nodes
    • scli --set_num_of_io_buffers --sds_ip X.X.X.X --num_of_io_buffers 10
  3. Increase the txqueue length to 10000 on each IPoIB NIC (double check me on all these commands for centos version)
    • echo 'ifconfig eth1 txqueuelen 10000' >> /etc/rc.local
  4. The following are kernel tweaks
    • echo 'kernel.shmmax=16000000000' >> /etc/sysctl.conf

    • echo 'net.ipv4.tcp_low_latency=0' >> /etc/sysctl.conf

    • echo 'net.ipv4.tcp_slow_start_after_idle=0' >> /etc/sysctl.conf

    • echo 'net.ipv4.tcp_timestamps=1' >> /etc/sysctl.conf

    • echo 'net.core.wmem_max=100000000' >> /etc/sysctl.conf

    • echo 'net.core.rmem_max=100000000' >> /etc/sysctl.conf

    • echo 'net.core.wmem_default=20000000' >> /etc/sysctl.conf
    • echo 'net.core.rmem_default=20000000' >> /etc/sysctl.conf
  5. Increase tx/rx values on all IPoIB NICs
    • ethtool -G eth1 rx 4096 tx 4096
  6. if you are hooking the nodes to an ESXi cluster increase the queue depth on all scaleIO volumes to 256
    • esxcli storage core device set -d eui.XXXXXXXXXXXXXXXXX -O 256
  7. Increase queue depth on the ESXi HCAs
    • esxcli system module parameters set -p "lpfc0_lun_queue_depth=254 lpfc1_lun_queue_depth=254" -m lpfc
  8. If you really want to tune your SDS's you should look into NUMA tuning.  Basically lining up each HCA PCIe slot with their corresponding memory and cpu channel.  I am not sure how centos handles cpu to PCIe assignment but windows completely su*ks at it.  Windows nodes do a round robin by default with their RSS profiles.  (Hey Microsoft, would be nice to set the RSS profile to closeststatic by default.  hint hint :-)
  9. Turn off C-States in Bios and if there is a high performance mode set it to the highest values.  Some servers have a "green" low power mode
  10. Haven't messed with the read/write cache feature in ScaleIO but I have heard nothing but praise.  Load up your nodes with some more ram and flip that on (32-64gigs should do the trick).
  11. Unless you need to don't throttling your nodes for rebuild/re-balance.  Not sure how spinning disks will fair but all flash nodes can handle this without a significant  drop in performance.

Hope this helps,  hit me up if you need any additional info.

Tristan

No Events found!

Top