Unsolved
This post is more than 5 years old
9 Posts
1
12658
March 21st, 2018 03:00
Terrible disk latency on Compellent SC2020
Hi,
We have a SC2020 compellent storage. De storage has 2 controllers with (2x) 4 iscsi network connectors. Both controllers are connected to 2x Aruba / HP stacked E3810 switches dedicated for iscsi traffic.On the other side there are 3 ESXI Dell PowerEdge 630 servers with 4 network adapters each for iscsi. Running VMware 6.0 with 44 virtual machines.
On the storage there are 8x 1TB volumes created in balanced raid 10 (11 disk 10K) and we have (1) 3TB volume created on 7 SSD's also in balanced raid 10.
The problem is that we see terrible latency on the volumes. On normal working hours we see latency spikes in seconds but don't see any stress on any ESX host. Of course the VM's come to a halt when these spikes happen. The strange thing is then we don't see any latency on the disks itself. But if we monitor the volumes or controllers then latency is between 10ms <> 7000ms. If we monitor the latency in VMware then we see the same bad results.
We checked everything in VMware. Used all the best practices but nothing really makes a difference. In VMware we use round robin as MPIO and we see the load is distributed on all 4 iscsi network adapters. We triple checked the ISCSI switches.
Controller latency
Disk latencyf
Volume latency
Any help would be much appreciated
Regards
Arnold
DELL-Sam L
Moderator
•
7.6K Posts
0
March 21st, 2018 14:00
Hello deBruin,
Are you seeing any errors on your switches at all or is it just on the hosts & on your SC2020?
Please let us know if you have any other questions.
deBruin
9 Posts
0
March 22nd, 2018 06:00
Hi,
We don't see any problems on the ISCSI switches. No latency and throughput is fine. All 4 ports are between 100Mb/s - 200Mb/s on normal working hours. Latency is between 100 - 200ms with spikes from 2 - 8 seconds. Errors in VMware about losing connections. Most of the time only on the top controller.
Changed IOPS roundrobin threshold in VMware from 1000 default to 1, Disabled AckDelayed setting. Move volumes to different controller. Replaced both ISCSI switches. Changes storage IO control in VMware.
DELL-Sam L
Moderator
•
7.6K Posts
0
March 26th, 2018 09:00
Hello deBruin,
Have you contacted support before about this issue, or do you have an open service request open about this issue. I ask that as we would need to review the logs from your SCv2020 as well as your hosts to see where the bottle neck is happening.
Please let us know if you have any other questions.
deBruin
9 Posts
0
April 4th, 2018 13:00
Hi,
Yes we have currently an open ticket: SR# :
SusanaD
6 Posts
0
April 14th, 2018 04:00
This problem can be from VMware, you need to contact their support center (E20-357, E20-555, E20-559)
deBruin
9 Posts
0
May 11th, 2018 03:00
Hi,
I don't see how this is a VMWare problem. I am willing to look at all possible causes but we followed all Dell best pratices. Also Dell support looked at the VMware configuration. Servers are all Dell. We also experiencing this problem with 2 HP Proliant servers which we connected for testing only.
One thing I find strange is that when the iops load gets higher the latency get lower.
So I thing there are only 2 possible causes. 1 something is wrong with the array, 2 something is wrong with the aruba switches.
thomasfa18
1 Message
0
May 18th, 2018 07:00
I'm also having weird latency issues doing a vmotion between two separate compellent storage centers, all of our traffic is on FC 16Gbps.
What is weirder is the latency is only observed on a single port of one destination controller (you can drill down your fault domains on the charting page). This port will spike to 1.5 seconds but hovers between 200ms and 600ms, for comparison the other ports sit in the sub 10ms range.
The fibre switches do not report any errors either...
DMPOL
1 Message
0
September 7th, 2018 04:00
Hi,
Have you found solution for this issue? We have similar problems and struggling with this second week. Infrastructure is almost identical as yours.
Cheers,
deBruin
9 Posts
0
September 19th, 2018 07:00
We never got this fixed. Dell support advised to upgrade the storage center firmware which we did but that didn't change anything. All others settings from best practice had no effect. So we eventually gave up. I am also not sure if it is a real problem. We considered the possibility that the statistics are wrong. Let say the OS transfers a 128KB packet to the storage. VMware will split this packet up in 2x 64Kb. Instead of measure the latency of both packets, we think it measures the latency of the total packet. Lets say 1x 64Kb packet has a latency of 5 ms which is fine. The storage will show the latency of 2x 64Kb packets = 10ms latency. So its shows a problem that's not really a problem. Again....this is just theory. We tried to support this theory by using IOmeter and different sized packets and then measure the latency.
An other thing we noticed is that we in normal operation never maxed out the 2x4 1Gb iscsi network adapters. However if you do a storage migration (moving a vmk to a different volume) in VMware, things are different. In this situation it is possible to max out the network adapters. If that happens all other iscsi traffic will freeze. This is why we looked at this latency problem in the first place. We eventually fixed that by limiting the vMotion network switch bandwidth in VMware so that this would not eat up all bandwidth. Doing a vmk move at peak working hours is probably a bad idea anyway :p
So basically we still have very high latency (between 0 - 200ms) if the workload is really low (<100 iops). All loads above about 500iops have a latency between 0 <> 10ms. Latency spikes only on read iops write iops are fine.
Everybody will say this is typical delayAck but we checked that multiple times.
Good luck.
deBruin
9 Posts
0
September 19th, 2018 07:00
We never got this fixed. Dell support advised to upgrade the storage center firmware which we did but that didn't change anything. All others settings from best practice had no effect. So we eventually gave up. I am also not sure if it is a real problem. We considered the possibility that the statistics are wrong. Let say the OS transfers a 128KB packet to the storage. VMware will split this packet up in 2x 64Kb. Instead of measure the latency of both packets, we think it measures the latency of the total packet. Lets say 1x 64Kb packet has a latency of 5 ms which is fine. The storage will show the latency of 2x 64Kb packets = 10ms latency. So its shows a problem that's not really a problem. Again....this is just theory. We tried to support this theory by using IOmeter and different sized packets and then measure the latency.
An other thing we noticed is that we in normal operation never maxed out the 2x4 1Gb iscsi network adapters. However if you do a storage migration (moving a vmk to a different volume) in VMware, things are different. In this situation it is possible to max out the network adapters. If that happens all other iscsi traffic will freeze. This is why we looked at this latency problem in the first place. We eventually fixed that by limiting the vMotion network switch bandwidth in VMware so that this would not eat up all bandwidth. Doing a vmk move at peak working hours is probably a bad idea anyway :p
So basically we still have very high latency (between 0 - 200ms) if the workload is really low (<100 iops). All loads above about 500iops have a latency between 0 <> 10ms. Latency spikes only on read iops write iops are fine.
Everybody will say this is typical delayAck but we checked that multiple times.
Good luck.
Bullockman
1 Message
0
September 17th, 2019 06:00
I think Dell knows about these issues. I have a couple of 2080s and they have lots of disks and should give plenty of performance. Every Dell support tech we talk to has nothing good to say about this series of storage. I think they just dumb'd down these boxes to much and should take responsibility for a poor product. Last tech I spoke with said these arrays were for testing only and not for really for production. Oh well 150k down the drain. I am now looking for a replacement.