Unsolved
This post is more than 5 years old
12 Posts
0
195972
September 19th, 2012 03:00
High Latency issues
We have a number of EqualLogic PS6000 and PS6500 units spread across four different sites.
The issue we are seeing is some massive latency spikes being seen by our VMs and physical servers.
When I say massive, these are over the 1000 milliseconds and can be over 2500.
After some traffic monitoring we found that the controllers were basically stopping all network traffic for around 11 seconds, there were working but not passing any iSCSI traffic.
This affects sites with between 1 and 7 units.
The site that most badly affected is also the site with the least VMs (10) compared to other sites with over 150. This site does however receive the most replication traffic, the sending site is the second worst affected but does not receive any replication traffic. Yesterday we had 109 events of more than 1000ms.
I have implemented the delayed ACK and LRO best practices but this has, if anything, made things worse, as there are 129 events so far today.
All 4 sites see these problems, though the other two to lesser extent. Dell support are offering the usual helpful "upgrade the latest firmware", but this problem has been with us from the start back when firmware was on version 4, so upgrading to the latest version is unlikely to fix anything, in fact painful experience would suggest it is more likely to introduce a different problem.
To see the spikes we have implemented a monitoring task in VMware that sends an email on seeing disk latency of more than 50 ms (warning) and 1000ms (critical), we never get any for warnings only the critical.
The iSCSI traffic passes through a Cisco switch stack, the VM hosts are a range HP blades, the physical servers are DL380 and DL580 servers, it's running on 1Gb links.
Has anyone else seen this before?
DELL-Joe S
7 Technologist
•
729 Posts
0
September 19th, 2012 09:00
I know you mentioned that you have setup delayed ACK and LRO, first double check these again (it would be very rare that theses setting if set properly would cause performance to be degraded).
General:
All switches and host network interfaces within the iSCSI infrastructure must have flow control enabled for optimal performance. The array has this enabled so you don’t need to do anything on the array side.
Do not use the default VLAN (typically VLAN1) for iSCSI, create a new VLAN.
1. Delayed ACK DISABLED
Disabling Delayed Ack in ESX 4.0, 4.1, and 5.x
1. Log in to the vSphere Client and select the host.
2. Navigate to the Configuration tab.
3. Select Storage Adapters.
4. Select the iSCSI vmhba to be modified.
5. Click Properties.
6. Modify the delayed Ack setting using the option that best matches your site's needs.
Choose one of the below options, I, II or II, then move on to step 7 after making the changes:
Option I:
Modify the delayed Ack setting on a discovery address (recommended).
A. On a discovery address, select the Dynamic Discovery tab.
B. Select the Server Address tab.
C. Click Settings.
D. Click Advanced.
Option II:
Modify the delayed Ack setting on a specific target.
A. Select the Static Discovery tab.
B. Select the target.
C. Click Settings.
D. Click Advanced.
Option III:
Modify the delayed Ack setting globally.
A. Select the General tab.
B. Click Advanced.
(Note: if setting globally you can also use vmkiscsi-tool
# vmkiscsi-tool vmhba41 -W -a delayed_ack=0)
7. In the Advanced Settings dialog box, scroll down to the delayed Ack setting.
8. Uncheck Inherit From parent. (Does not apply for Global modification of delayed Ack)
9. Uncheck DelayedAck.
10. Reboot the ESX host.
Re-enabling Delayed ACK in ESX 4.0, 4.1, and 5.x
1. Log in to the vSphere Client and select the host.
2. Navigate to the Advanced Settings page as described in the preceding task "Disabling Delayed Ack in ESX 4.0, 4.1, and 5.x"
3. Check Inherit From parent.
4. Check DelayedAck.
5. Reboot the ESX host.
Checking the Current Setting of Delayed ACK in ESX 4.0, 4.1, and 5.x
1. Log in to the vSphere Client and select the host.
2. Navigate to the Advanced Settings page as described in the preceding task "Disabling Delayed Ack in ESX 4.0, 4.1, and 5.x."
3. Observe the setting for DelayedAck.
If the DelayedAck setting is checked, this option is enabled.
If you perform this check after you change the delayed ACK setting but before you reboot the host, the result shows the new setting rather than the setting currently in effect.
2. Large Receive Offload DISABLED
# esxcfg-advcfg -g /Net/TcpipDefLROEnabled
To set the LRO value to zero (disabled):
# esxcfg-advcfg -s 0 /Net/TcpipDefLROEnabled
Note: a server reboot is required
Use the following link about changing LRO in the Guest network.
docwiki.cisco.com/.../Disable_LRO
3. Make sure you are using either VMware Round Robin (with IOs per path changed to 3), or the EqualLogic MEM 1.1.0
4. If you have multiple VMDKs (or RDMs) in the VM, each of them (up to 4) needs its own Virtual SCSI adapter.
5. Update to latest build of ESX, to get latest NIC drivers.
6. You have too few volumes for the number of VMs.
-joe
Lorribot
12 Posts
0
September 20th, 2012 02:00
Have rechecked everything and LRO and Delayed Ack are both disabled, Round Robin is the method used.
We don't use MEMs, we have the Host integration tools for VMware instead to handle the multipathing.
Not sure what you mean by number 4. The servers only have two physical adapters, what would be the point of puting in more than 2 virtual adapters in?
The system with most timeouts has the fewest number of servers, but is teh biggest reciever of replication traffic, the second highest number of alerts is the biggest sender of replication traffic. The alerts can happen when no replication is happening,
VMware version is 4.1 build number is 800380, which is as up to date as possible.
There are 10 servers running on 3 hosts, each server has no more than 3 VMDK files. Occasionally we bring up other servers to test DR, but there woud be very little disk load on these. The backend is 2 x PS6500. I would have thought that it woudl cope with out any issues whatsoever and are generally way over specced.
We have Identified that the controllers on the EQ systems will do nothing for an 11 second period, during this period it will be active on the network but will not be processing any iSCSI traffic, whether it is VMware that is stopping or EQ I know not.
Lorribot
12 Posts
0
September 21st, 2012 03:00
I have no idea why I thought the HIt for VM would manage the multipathing or that it was a replacement for MEM. Must have read something somewhere.
Having said that, having it in the environment has stoped one problem we were getting where by the hosts would report loss of connections or paths, so it must be doing something related to manging the hosts and paths, whether it is just allowing the EqualLogics to talk to the vCentre or something more I know not.
I guess I will need to investigate the MEMs thing, don't seem to remember any mention of it when the environment was first created, only that the ASM/VE was not worth using.
I was getting a little confused between iSCSI and SCSI. Most of our servers have data drive on a second SCSI adapter, these are for the most part Paravirtual adapters. I will do some testing to see multiple adapters can improve anything.
The vast majority of servers use the Paravirtual controllers for VMDK disks.
I am not a big fan of all these extra bits you need to add on to the hosts. Trying to keep it all up to date can be onerous and also restrict timings of upgrades as you wait for support for new versions. One of the reasons I disliked fibre channel so much was you had to spend a week just trying to work out what versions of firmware and drivers you needed in everything to get it all working properly before changing anything, then another week planning down time, and another week to carry it all out.
Simplicity is the key to happiness.
Lorribot
12 Posts
0
September 21st, 2012 05:00
I can't believe how unnecessarily complicated and wordy Dell make their MEM documentation. It etremely confusing. Why they insist on jumping between different versions of ESXi instead of sticking to one at a time just adds complications you don't need.
They are almost as bad as Symantec.
If you already have a vSwitch setup and configured it seems you only need to install the module using the update manager and it sets itself up. 30 pages to say that is just stupid. Unless of course I missed something in all those words.........
Lorribot
12 Posts
0
September 24th, 2012 02:00
VLAN is 21, this is not configured on the vSwitch but is configured on all the physical switches, this is a seperate iSCSI network with no other vlans connected.
Cisco ports are configured as
interface GigabitEthernet1/0/1
switchport access vlan 21
switchport mode access
flowcontrol receive desired
spanning-tree portfast
Lorribot
12 Posts
0
September 24th, 2012 02:00
Flow Control is enabled
VLAN is 0 (or that is what is showing in the vSwitch config)
Delayed ACK DISABLED
Large Receive Offload DISABLED
EqualLogic MEM 1.1.0 INSTALLED
ESXi updated to latest
We are still getting the latency problems. The changes have made absolutely no difference.
Lorribot
12 Posts
0
September 25th, 2012 01:00
Last case number was 862752158, which may still be open, a previous one was 853280190
austinrocker
10 Posts
0
September 27th, 2012 17:00
I've heard there similar latency issues with Powervault arrays too.
bealdrid2
1 Rookie
•
117 Posts
0
June 16th, 2015 11:00
I hate to resurrect this from 3 years ago, but we are having nearly this exact same issue, albeit with PS6510E arrays. Every so often the controllers on one array seem to just "check out" for a few seconds. All VM / guest volume io just stops. The FS7610 NAS controllers complain about access to the storage arrays at the same time. Did you ever come to any resolution on this?
I have a case with Dell open on this for the past couple months almost; and while they've been prompt in requesting info/diags and providing feedback, we just haven't gotten very far unfortunately. We've tried tons of stuff: Switch firmware, significantly offloading the arrays in question, temporarily connecting to different switches, cables, best practice documents, etc.
I have other arrays (6210E/S) that never do this and that run similar type workloads.
Screenshot from SANHQ Live view i just so happened to catch it during one of the events:
Lorribot
12 Posts
0
June 16th, 2015 13:00
Never found a cure, we have PS6500s on 7.15 and we still get alerts on fairly light use systems. the 6500 seem most afflicted. We did see similar spikes and situations where the EQ just paused for short period but never got to the bottom of it.
We have moved to EMC VNX (on FCoE) for our main storage but to be honest not a fan, they could learn something for the functionality and ease of use of the EqualLogic.
bealdrid2
1 Rookie
•
117 Posts
0
June 18th, 2015 10:00
Lorribot,
Thank you for coming back to a 3 yr old thread and reading my post. Hate to year you never got it solved, and hope I have better luck. I'll make certain to update you on the resolution here, since just like in your case, this has been going on for a very long time. I just uploaded (for the 5th time seems like) EQL diags, VM support archives, and switch tech supports.
bealdrid
1 Rookie
•
56 Posts
0
April 21st, 2016 09:00
Lorribot, are you still around? You've been the only other person I've found online that has run into what looks like the same issue as me. Not sure if you are still using the EqualLogics in this topic, but if you recall, by any chance were you making use of the Template Volumes and Thin Clone feature set? Its about the only lead we have so far. Everyone is pretty stumped and my case has been open over a year now with Dell and has been escalated up to engineering and development.
Lorribot
12 Posts
0
April 22nd, 2016 02:00
We never got to the bottom of the latency issues and just ignored them.
Ultimately I believe the sub system performanace on the SATA boxes is not sufficient. They are fine for things like file server but anything else is pushing things, especially if you start to load it with a significant number of servers.
We still see them, the SAS based units seem to be better but the SATA ones just can't cope with the work loads. Newer boxes with beffier controllers maybe better but at the end of the day it is what it is, a cheap highly featured low performance SAN solution.
We have moved to VNX for most systems and just use the EqualLogics for small branch operations.
I like the interface and ease of management on the EQ but performance is dire. I am hopeful that Dell can make the EMC kit much nicer to use now they have bought them.