Start a Conversation

Unsolved

This post is more than 5 years old

1715

August 30th, 2013 21:00

Mirrorview using iSCSI - Initial Synchronization Very Slow

I'm using Mirrorview between two Clariion storage units, both are CX4-120s.  The system has 1Gbps links between them, with very low latency (1ms or less).  I have deployed the "A" and "B" Ethernet fabric, as is recommended by EMC.

I have setup two LUNs for replication, one owned by SPA, one by SPB.  Both LUNs perform their initial synchronization VERY slowly.  I have a 1TB LUN, for instance, which I only see about 57Mbps on the Ethernet switch.  Upon doing a Wireshark capture of the traffic between the Clariions, it seems that the arrays are sending one 'ACK' for every single data packet.  They don't appear to be using proper TCP Windowing at all.  I would think that arrays doing large data transfers such as this would use proper windowing, but also even utilize TCP1323 extensions to permit Window scaling.

Having an 'ACK' for every single data packet is definitely causing performance issues, and this initial synchronization is taking WAY too long (days).  Having the initial synchronization take that long poses reliability and maintainability issues for me, it cannot take so long.

I've tried many things to get the arrays moving faster.  I have tried:

  • destroying and re-creating the mirror
  • deleting and re-creating the iSCSI paths between the arrays, including deleting and re-creating the iSCSI login information (credentials)
  • exhaustive examination of every network port and setting in the entire path
  • tested network's ability to transfer data using two Windows hosts, and I got full-speed transfers with Windows and ROBOCOPY, no problems.
  • rebooting network gear in the path
  • rebooting the SPs on the array (on both sides)
  • complete power cycle of the arrays (on both sides)
  • opened a support case with EMC (not showing signs of promise)

Does anyone have any pointers?  Has anyone done this with iSCSI?  Has it worked well?  I expected to see a lot more throughput on this solution.  I have also spent some time looking at as many possible NAVI-CLI commands and options (admittedly, not all of them), in hopes I could fine-tune the TCP settings on the arrays, but have found nothing useful.

It seems like basic poor TCP behavior, which is disappointing.

Any help would be appreciated.

4 Operator

 • 

4K Posts

September 1st, 2013 22:00

MirrorView/S or MirrorView/A? What's the version of FLARE code?

You can consider upgrading your CX4 to the latest version 04.30.000.5.525. Since you're using iSCSI, you may refer to the following KBs as well,

emc156408 "Problems with iSCSI hosts when iSCSI ports share the same subnet": https://support.emc.com/kb/41172

emc238702 "iSCSI Mirrorview link issue": https://support.emc.com/kb/70133

emc245445 "Facts, limitations, and recommended settings when using CLARiiON iSCSI": https://support.emc.com/kb/71615

1.4K Posts

September 2nd, 2013 00:00

Hi Ammunist,

Could you please, Private Message me the EMC SR#?

Thanks!

2 Posts

September 12th, 2013 09:00

Thought I would post an update, in case others run into situations like this.  iSCSI is a very specific type of network animal. We have firewalls in the path, and this is causing us the pain.  Here is a sanitized chunk of the email I sent other folks in my organization to explain...

So, in the process of troubleshooting this issue, I did this:

    

  • Upgraded network device software
  • Set ports to 1000/full everywhere
  • Tried putting traffic on different network paths
  • Tried turning of network configuration settings
  • Opened a case with network equipment manufacturer
  • Tried adjusting the TCP MSS on network equipment slightly up and down (1460 and 1200)
  • Tried using jumbo frames (some network gear in the path doesn't support it, so this didn't work)
  • Tried replacing cables at both sites
  • Tried tuning trunk between switches
  • Tried switching SPs
  • Tried fiddling with cache settings

         

Nothing made anything better.  The replication continues to run at ~70Mbps.

EMC says everything is working fine.

Network equipment manufacturer says everything is working fine.

The issue here, I believe, is that the EMC Mirrorview product is very inefficient with how it does TCP (but EMC has good reasons why it’s doing this).  It sends an ACK packet for every single data packet, so as to perform truly synchronous replication.  The firewall tracks these (part of being a ‘stateful’ firewall).  This adds a puny, puny bit of time (we’re talking thousandths or tens-of-thousandths of a second) to each ACK packet being sent and received.  The firewall is really, really fast at doing this.  Normally (with other bulk-data-transfer operations), this not a big deal, as there is only one ACK per many, many data packets when using TCP optimizations.  However, with one ACK for every single data packet, this puny, puny amount of time results in adding latency to the next bunch of data packets being sent, and therefore overall reduces throughput quite a bit.  It is a problem of quantity.  There are many millions of data packets, and many millions of ACKs, and if you add up all of the puny overhead time for state tracking, the result is reduced throughput.

    

When purchasing datacenter switches, vendors are always fighting over port latency.  This is exactly the issue we’re having here.  For protocols like iSCSI, which require near-zero latency on packet reception and transmission, generally you look for switches with the fastest (smallest) port latency.  Latency in datacenter switching is measured in microseconds (µs), that is, millionths of a second.  Modern switches have ratings on normal packets something like 1.8 µs. Firewalls are a whole other league.  They are security devices, not datacenter switching devices.  They are not just pushing packets along, they are correlating them, doing many security checks, ACL rules, encryption, reverse route verification, etc.  Latency in them is generally measures in milliseconds (ms), which are thousandths of a second.  The firewalls we are using have a latency of somewhere between .25ms and .5ms (depending on load).  As firewalls go, that is very fast (consider all the junk it is doing with every single ACK packet).  As datacenter switches go, that is pretty slow.  TCP has optimizations for work-around this increase in latency, by sending mountains of data between each packet the firewall might find ‘interesting’ and have to track, thereby adding it’s latency overhead.  However, these optimizations increase the exposure to data-in-flight over the wire, without any delivery confirmation.  This works great for asynchronous, bulk data moves (like windows servers copying files).  Not so good for synchronous replication (Mirrorview/S).

No Events found!

Top