Start a Conversation

Unsolved

This post is more than 5 years old

5189

December 3rd, 2014 04:00

RecoverPoint Case Study, How to troubleshooting remote IP replication high load issue

RecoverPoint Case Study, How to troubleshooting remote IP replication high load issue

Introduction

This article introduces a case study of how to troubleshooting remote IP replication high load issue.


Detailed Information

EMC RecoverPoint is an enterprise-scale solution designed to protect application data on heterogeneous SAN-attached servers and storage arrays. RecoverPoint runs on an out-of-band appliance and combines industry-leading continuous data protection technology with a bandwidth efficient, no-data-loss replication technology.


EMC RecoverPoint provides local and remote data protection, enabling reliable replication of data over any distance; that is, locally within the same Site, and remotely to another Site. Specifically, RecoverPoint protects and supports replication of data that applications are writing over Fibre Channel to local SAN-attached storage. RecoverPoint uses an existing Fibre Channel infrastructure to integrate seamlessly with existing host applications and data storage subsystems. For long distances, RecoverPoint uses an existing IP network to send replication data over a WAN.

RecoverPoint utilizes write splitters that monitor writes and ensure that a copy of all writes to a protected volume are tracked and sent to the local RecoverPoint appliance. RecoverPoint supports four different types of write-splitters. They are fabric-based write splitter, host-based write splitter, array-based write splitter and VPLEX write splitter.

RP_1.png

RecoverPoint CRR (Continuous Remote Replication) provides bidirectional, heterogeneous block-level replication across any distance using asynchronous, synchronous, and snapshot technologies over FC and IP networks.

RP_2.png

A consistency group consists of one or more replication sets. Each replication set consists of a production volume and the replica volumes to which it is replicating. The consistency group ensures that updates to the replicas are always consistent and in correct write order; that is, the replicas can always be used to continue working or to restore the production source, in case it is damaged.

Replication High Load

With several of resources limitation, replication will have high load issue. High load issue occurs too frequently causes customer experiences reduction. Common high load issue will caused replication turn to “temporary high load”. Following four scenarios are the reasons of “temporary high load”

1.     In rare cases, long time overloaded writes IO workloads.

2.     Replication Volume & Journal Volumes are too slow.

3.     There is Insufficient WAN bandwidth.

4.     There are improper compression settings.

Overloaded write IO workloads and improper compression settings are easy to be detected. But Replication Volume & Journal Volumes are too slow issue is hidden. Below is the troubleshooting example about them.

WAN bandwidth issue detection:

Ensure the point in time of high load issue first, it can be found in CLI folder of RPA log.

High load event:

Time:             Wed Aug  7 10:41:24 2013

  Topic:            GROUP

  Scope:            DETAILED

  Level:            WARNING

  Event ID:         4019

  Site:             7DaysInn_KXC

  Links:            [WFO-04_CG, WFO-04_KXC -> WFO-04_RMZ]

  Summary:          Group in high load -- transfer to be paused temporarily  Service Request info:N/A

Checking the incoming write rate before point in time, the rate is performance metrics which can be found in performance logs. We found before the event occurs, the troughput average are 10-20MB/s

2013/08/07 02:32:07.910 GMT    2013/08/07 02:33:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      15.2306 Megabytes/sec

2013/08/07 02:33:07.910 GMT    2013/08/07 02:34:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      12.8331 Megabytes/sec

2013/08/07 02:34:07.910 GMT    2013/08/07 02:35:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      5.81344 Megabytes/sec

2013/08/07 02:35:07.910 GMT    2013/08/07 02:36:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      17.8956 Megabytes/sec

2013/08/07 02:36:07.910 GMT    2013/08/07 02:37:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      20.43     Megabytes/sec

2013/08/07 02:37:07.910 GMT    2013/08/07 02:38:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      19.557   Megabytes/sec

2013/08/07 02:38:07.910 GMT    2013/08/07 02:39:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      21.2274Megabytes/sec

2013/08/07 02:39:07.910 GMT    2013/08/07 02:40:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      11.2537 Megabytes/sec

2013/08/07 02:40:07.910 GMT    2013/08/07 02:41:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      16.1114 Megabytes/sec

2013/08/07 02:41:07.910 GMT    2013/08/07 02:42:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    Incoming writes rate for link                                      17.0141 Megabytes/sec

Which cause log increasing from 100M to 7G

2013/08/07 02:31:07.910 GMT    2013/08/07 02:32:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          106.659                Megabytes

2013/08/07 02:32:07.910 GMT    2013/08/07 02:33:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          673.536                Megabytes

2013/08/07 02:33:07.910 GMT    2013/08/07 02:34:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          1405.1                Megabytes

2013/08/07 02:34:07.910 GMT    2013/08/07 02:35:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          1745.29                Megabytes

2013/08/07 02:35:07.910 GMT    2013/08/07 02:36:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          2469.8                Megabytes

2013/08/07 02:36:07.910 GMT    2013/08/07 02:37:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          3581.55                Megabytes

2013/08/07 02:37:07.910 GMT    2013/08/07 02:38:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          4447.67                Megabytes

2013/08/07 02:38:07.910 GMT    2013/08/07 02:39:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          5613.29                Megabytes

2013/08/07 02:39:07.910 GMT    2013/08/07 02:40:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          6738.1                Megabytes

2013/08/07 02:40:07.910 GMT    2013/08/07 02:41:07.910 GMT    long term            7DaysInn_KXC   RPA1     WFO-04_CG                WFO-04_KXC     WFO-04_RMZ                    RPO - lag in data between replicas during transfer after init          7099.98                Megabytes

Based on above two evidences, High load issue might be caused by:

There is insufficient WAN bandwidth which unable to transfer the data timely.

Slow storage which cause write IO high response time and queued.

At this time, if customer confirm there one of two bottleneck in their environments, we can locate the cause. If not, additional data are required to breaking down the root cause. Further analysis shows the average transfer time during peak time is only 3Mbps.

2013/08/06 20:04:07.910 GMT    2013/08/06 20:05:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              1.50716 Megabits/sec    

2013/08/06 20:05:07.910 GMT    2013/08/06 20:06:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              3.18644 Megabits/sec    

2013/08/06 20:06:07.910 GMT    2013/08/06 20:07:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              3.35739 Megabits/sec    

2013/08/06 20:07:07.910 GMT    2013/08/06 20:08:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              3.32743 Megabits/sec    

2013/08/06 20:08:07.910 GMT    2013/08/06 20:09:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              5.50845 Megabits/sec    

2013/08/06 20:09:07.910 GMT    2013/08/06 20:10:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              7.36306 Megabits/sec    

2013/08/06 20:10:07.910 GMT    2013/08/06 20:11:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              6.83884 Megabits/sec    

2013/08/06 20:11:07.910 GMT    2013/08/06 20:12:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              6.1095   Megabits/sec    

2013/08/06 20:12:07.910 GMT    2013/08/06 20:13:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              3.18375 Megabits/sec    

2013/08/06 20:13:07.910 GMT    2013/08/06 20:14:07.910 GMT    long term            7DaysInn_KXC                   WAN throughput from site              0.375969              Megabits/sec  

At this time, we can confirm the lack of WAN bandwidth because of customer’s networking configuration which cause 40Mbps bandwidth utilization is only 3Mbps.

Journal Volumes is too slow issue:

Ensure the point in time of high load issue first in CLI folder.

Time:             Wed Dec 18 00:45:19 2013

  Topic:            GROUP

  Scope:            DETAILED

  Level:            WARNING

  Event ID:         4019

  Site:             Chai_Wan

  Links:            [IOCM, IOCM-DC -> IOCM-DR]

  Summary:          Group in high load -- transfer to be paused temporarily  Service Request info:N/A

Then checking the Journal Volume distributor related performance log. There are two metrics:

Distributor phase 2 thread load

Distributor phase 2 effective speed

2013/12/17 16:36:05.869 GMT    2013/12/17 16:37:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        87.272   % of time

2013/12/17 16:37:05.869 GMT    2013/12/17 16:38:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        97.4623 % of time

2013/12/17 16:38:05.869 GMT    2013/12/17 16:39:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        95.9147 % of time

2013/12/17 16:39:05.869 GMT    2013/12/17 16:40:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        90.7287 % of time

2013/12/17 16:40:05.869 GMT    2013/12/17 16:41:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        87.7186 % of time

2013/12/17 16:41:05.869 GMT    2013/12/17 16:42:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        91.9028 % of time

2013/12/17 16:42:05.869 GMT    2013/12/17 16:43:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        91.6016 % of time

2013/12/17 16:43:05.869 GMT    2013/12/17 16:44:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        92.5544 % of time

2013/12/17 16:44:05.869 GMT    2013/12/17 16:45:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        96.4623 % of time

2013/12/17 16:45:05.869 GMT    2013/12/17 16:46:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        85.6245 % of time

2013/12/17 16:46:05.869 GMT    2013/12/17 16:47:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        87.484   % of time

2013/12/17 16:47:05.869 GMT    2013/12/17 16:48:05.869 GMT                    IOCM    IOCM-DR                                             Distributor phase 2 thread load        94.5462 % of time

2013/12/17 16:38:05.869 GMT    2013/12/17 16:39:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               3.63361 Megabytes/sec

2013/12/17 16:39:05.869 GMT    2013/12/17 16:40:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               5.08614 Megabytes/sec

2013/12/17 16:40:05.869 GMT    2013/12/17 16:41:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               5.44455 Megabytes/sec

2013/12/17 16:41:05.869 GMT    2013/12/17 16:42:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               6.10622 Megabytes/sec

2013/12/17 16:42:05.869 GMT    2013/12/17 16:43:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               5.67813 Megabytes/sec

2013/12/17 16:43:05.869 GMT    2013/12/17 16:44:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               4.28963 Megabytes/sec

2013/12/17 16:44:05.869 GMT    2013/12/17 16:45:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               3.59579 Megabytes/sec

2013/12/17 16:45:05.869 GMT    2013/12/17 16:46:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               3.03741 Megabytes/sec

2013/12/17 16:46:05.869 GMT    2013/12/17 16:47:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               2.95798 Megabytes/sec

2013/12/17 16:47:05.869 GMT    2013/12/17 16:48:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               4.20823 Megabytes/sec

2013/12/17 16:48:05.869 GMT    2013/12/17 16:49:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               3.99813 Megabytes/sec

2013/12/17 16:49:05.869 GMT    2013/12/17 16:50:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               4.75266 Megabytes/sec

2013/12/17 16:50:05.869 GMT    2013/12/17 16:51:05.869 GMT    RPA1     IOCM    IOCM-DR                                             Distributor phase 2 effective speed               7.01075 Megabytes/sec

Through performance log, we can find the Distributor phase 2 thread is highly loaded, but the Distributor phase 2 effective speed is low. That mean the Journal Volume distributing speed is low. Possible reason is SAN problem or storage performance bottleneck.

Above examples introduce how to detect the root cause of two kinds of RecoverPoint high load issues. But the analysis of high load issue is much more than it. For example, the insufficient bandwidth issue may be caused by various of reasons which is beyond RecoverPoint side much than the Networking topic. Regarding the Journal Volume low speed issue, it may be caused by connectivity as well as slow storage which all need further analysis out of ReocverPoint which are not covered in the article. Please note the cause of performance is varied and extend to the rest of devices in storage stacks.

https://emc.my.salesforce.com/ka07000000052Rx

                                                         



Author: Fenglin Li

             

iEMC APJ

Please click here for for all contents shared by us.


No Responses!
No Events found!

Top