Unsolved
This post is more than 5 years old
10 Posts
0
1715
May 8th, 2017 08:00
ScaleIO Random Write Performance Issue
ScaleIO: v2.0.1.2
OS: Ubuntu 14.04
Configuration: Three storage nodes running both SDS and MDM (M, S and TB); Eight separate SDC clients
Performance Profile: High
Test: fio 2.1.15; to eight raw volumes, one per client; fio queue depth for all random IO workloads is 512 per client.
Issue:
1. All fio results, except 256KB or 1MB random writes, are as expected for the available network bandwidth.
2. For 256KB and 1MB random writes performance starts at expected throughput (7.2GB/s) but then drops (2.8GB/s).
3. For reference, 256KB and 1MB sequential writes performance starts and finishes at expected throughput (7.2GB/s).
4. For reference, 4KB random writes performance tops out at to 280K IOPS
Question:
1. Does ScaleIO implement some kind of write region locking that would limit performance for higher random write IO queue depths?
Thanks,
CCMC
------------------------ Updated 5-11-2017 ----------------------------------
ScaleIO 2.0 Cluster Configuration CSV file
I also performed 256KB random write testing with different queue depths. If I keep total queue depth to 15, throughput remains stable at about 2.2GB/s. A queue depth of 15 results in 30 outstanding writes to the 3 SDS nodes, one to each of the 30 NVMe SSDs in the one storage pool. If I increase total queue depth to 32 (32 * 2 = 64 actual queue depth for 256KB random writes), throughput is initially 4.0GB/s but then drops to 2.4GB/s, similar behavior as my initial 256KB random write testing with queue depths of 1024 or larger.
Thanks,
CCMC
----------------------- 5-13-2017 ---------------------------------
Below is a chart showing the ScaleIO 256KB random write performance issue I'm investigating.
Please note performance scales and is stable until queue depth reaches 32 then is limited by some mechanism in ScaleIO?
I suspect something like region/strip write locking.
Note 256KB sequential write performance (throughput) scales to network limit whereas random writes do not.
Thanks,
CMCC
----------------------------------------------------------- Update 5-18-2017 ------------------------------------
Network Topology:
* ScaleIO cluster server nodes - 100Gb/s dual port NIC
* ScaleIO client server nodes - 40Gb/s dual port NC
* One Mellanox 100Gb/s and one Mellanox 40Gb/s network switch (Note: Is not a multiple switch layer or leaf network configuration)
Thanks,
CCMC
--------------------------------- 5-31-2017 ---------------------------------------------
I believe I've identified the problem. A slow device on one of the nodes. After several minutes under IO workload, I observe queue depth at one device jumping significantly higher than any other device in the cluster. The IO backlog on the slow device eventually causes queue depth of other devices on that same node to intermittently jump up and then return to normal. But the performance slow down behavior always initially starts with the same device and node. The device appears to be slow enough to impact overall performance but not slow enough to trigger ScaleIO Oscillating failure handling.
Thanks,
CCMC
DanAharoni
16 Posts
0
May 19th, 2017 09:00
Hi,
Any chance we can take a look at this system? if yes, please write me an email.
I can't tell from this description what the exact config is. seems like you have 3 nodes with 10xNVMe drives each (what type?), but I am not sure what the network bandwidth is. it is also not clear to me if you are using some sort of file system with some buffering. it is best to use raw devices for tests and also make sure you use the direct IO option in FIO to prevent any buffering because that can cause something similar to what you are seeing.
SIO does not have any special locks. you should be able to get the same bandwidth you are getting with seq write and random write.
280K writes/sec is very nice :-)
Dan
DanAharoni
16 Posts
0
May 25th, 2017 18:00
Hi,
Before I address the performance I would like to note that although 40Gb is a very nice bandwidth, however, it is not a good idea to only use one port per node, because it means that every time someone will pull the network cable, the node will have to go through a node rebuild. it is therefore highly recommend to use at least 2 ports per node.
in terms of performance, there is no obvious reason you are getting less write bandwidth in random writes vs. seq writes. they should be the same.
it is hard without more details to know why. it might be the way the IOs are generated, it might be some file system you are using, it might be some OS setting (please follow carefully the fine tuning doc).
the only thing I can testify is that we can max out the network in our testing, even with 40Gb ports.
thanks,
Dan
DanAharoni
16 Posts
0
June 4th, 2017 10:00
yes, this would certainly explain such an issue.
impressive performance :-)
Dan