Unsolved
This post is more than 5 years old
2 Intern
•
308 Posts
1
5189
December 3rd, 2014 04:00
RecoverPoint Case Study, How to troubleshooting remote IP replication high load issue
RecoverPoint Case Study, How to troubleshooting remote IP replication high load issue
Introduction
This article introduces a case study of how to troubleshooting remote IP replication high load issue.
Detailed Information
EMC RecoverPoint is an enterprise-scale solution designed to protect application data on heterogeneous SAN-attached servers and storage arrays. RecoverPoint runs on an out-of-band appliance and combines industry-leading continuous data protection technology with a bandwidth efficient, no-data-loss replication technology.
EMC RecoverPoint provides local and remote data protection, enabling reliable replication of data over any distance; that is, locally within the same Site, and remotely to another Site. Specifically, RecoverPoint protects and supports replication of data that applications are writing over Fibre Channel to local SAN-attached storage. RecoverPoint uses an existing Fibre Channel infrastructure to integrate seamlessly with existing host applications and data storage subsystems. For long distances, RecoverPoint uses an existing IP network to send replication data over a WAN.
RecoverPoint utilizes write splitters that monitor writes and ensure that a copy of all writes to a protected volume are tracked and sent to the local RecoverPoint appliance. RecoverPoint supports four different types of write-splitters. They are fabric-based write splitter, host-based write splitter, array-based write splitter and VPLEX write splitter.
RecoverPoint CRR (Continuous Remote Replication) provides bidirectional, heterogeneous block-level replication across any distance using asynchronous, synchronous, and snapshot technologies over FC and IP networks.
A consistency group consists of one or more replication sets. Each replication set consists of a production volume and the replica volumes to which it is replicating. The consistency group ensures that updates to the replicas are always consistent and in correct write order; that is, the replicas can always be used to continue working or to restore the production source, in case it is damaged.
Replication High Load
With several of resources limitation, replication will have high load issue. High load issue occurs too frequently causes customer experiences reduction. Common high load issue will caused replication turn to “temporary high load”. Following four scenarios are the reasons of “temporary high load”
1. In rare cases, long time overloaded writes IO workloads.
2. Replication Volume & Journal Volumes are too slow.
3. There is Insufficient WAN bandwidth.
4. There are improper compression settings.
Overloaded write IO workloads and improper compression settings are easy to be detected. But Replication Volume & Journal Volumes are too slow issue is hidden. Below is the troubleshooting example about them.
WAN bandwidth issue detection:
Ensure the point in time of high load issue first, it can be found in CLI folder of RPA log.
High load event:
Time: Wed Aug 7 10:41:24 2013
Topic: GROUP
Scope: DETAILED
Level: WARNING
Event ID: 4019
Site: 7DaysInn_KXC
Links: [WFO-04_CG, WFO-04_KXC -> WFO-04_RMZ]
Summary: Group in high load -- transfer to be paused temporarily Service Request info:N/A
Checking the incoming write rate before point in time, the rate is performance metrics which can be found in performance logs. We found before the event occurs, the troughput average are 10-20MB/s
2013/08/07 02:32:07.910 GMT 2013/08/07 02:33:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 15.2306 Megabytes/sec
2013/08/07 02:33:07.910 GMT 2013/08/07 02:34:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 12.8331 Megabytes/sec
2013/08/07 02:34:07.910 GMT 2013/08/07 02:35:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 5.81344 Megabytes/sec
2013/08/07 02:35:07.910 GMT 2013/08/07 02:36:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 17.8956 Megabytes/sec
2013/08/07 02:36:07.910 GMT 2013/08/07 02:37:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 20.43 Megabytes/sec
2013/08/07 02:37:07.910 GMT 2013/08/07 02:38:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 19.557 Megabytes/sec
2013/08/07 02:38:07.910 GMT 2013/08/07 02:39:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 21.2274Megabytes/sec
2013/08/07 02:39:07.910 GMT 2013/08/07 02:40:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 11.2537 Megabytes/sec
2013/08/07 02:40:07.910 GMT 2013/08/07 02:41:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 16.1114 Megabytes/sec
2013/08/07 02:41:07.910 GMT 2013/08/07 02:42:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ Incoming writes rate for link 17.0141 Megabytes/sec
Which cause log increasing from 100M to 7G
2013/08/07 02:31:07.910 GMT 2013/08/07 02:32:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 106.659 Megabytes
2013/08/07 02:32:07.910 GMT 2013/08/07 02:33:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 673.536 Megabytes
2013/08/07 02:33:07.910 GMT 2013/08/07 02:34:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 1405.1 Megabytes
2013/08/07 02:34:07.910 GMT 2013/08/07 02:35:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 1745.29 Megabytes
2013/08/07 02:35:07.910 GMT 2013/08/07 02:36:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 2469.8 Megabytes
2013/08/07 02:36:07.910 GMT 2013/08/07 02:37:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 3581.55 Megabytes
2013/08/07 02:37:07.910 GMT 2013/08/07 02:38:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 4447.67 Megabytes
2013/08/07 02:38:07.910 GMT 2013/08/07 02:39:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 5613.29 Megabytes
2013/08/07 02:39:07.910 GMT 2013/08/07 02:40:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 6738.1 Megabytes
2013/08/07 02:40:07.910 GMT 2013/08/07 02:41:07.910 GMT long term 7DaysInn_KXC RPA1 WFO-04_CG WFO-04_KXC WFO-04_RMZ RPO - lag in data between replicas during transfer after init 7099.98 Megabytes
Based on above two evidences, High load issue might be caused by:
There is insufficient WAN bandwidth which unable to transfer the data timely.
Slow storage which cause write IO high response time and queued.
At this time, if customer confirm there one of two bottleneck in their environments, we can locate the cause. If not, additional data are required to breaking down the root cause. Further analysis shows the average transfer time during peak time is only 3Mbps.
2013/08/06 20:04:07.910 GMT 2013/08/06 20:05:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 1.50716 Megabits/sec
2013/08/06 20:05:07.910 GMT 2013/08/06 20:06:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 3.18644 Megabits/sec
2013/08/06 20:06:07.910 GMT 2013/08/06 20:07:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 3.35739 Megabits/sec
2013/08/06 20:07:07.910 GMT 2013/08/06 20:08:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 3.32743 Megabits/sec
2013/08/06 20:08:07.910 GMT 2013/08/06 20:09:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 5.50845 Megabits/sec
2013/08/06 20:09:07.910 GMT 2013/08/06 20:10:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 7.36306 Megabits/sec
2013/08/06 20:10:07.910 GMT 2013/08/06 20:11:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 6.83884 Megabits/sec
2013/08/06 20:11:07.910 GMT 2013/08/06 20:12:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 6.1095 Megabits/sec
2013/08/06 20:12:07.910 GMT 2013/08/06 20:13:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 3.18375 Megabits/sec
2013/08/06 20:13:07.910 GMT 2013/08/06 20:14:07.910 GMT long term 7DaysInn_KXC WAN throughput from site 0.375969 Megabits/sec
At this time, we can confirm the lack of WAN bandwidth because of customer’s networking configuration which cause 40Mbps bandwidth utilization is only 3Mbps.
Journal Volumes is too slow issue:
Ensure the point in time of high load issue first in CLI folder.
Time: Wed Dec 18 00:45:19 2013
Topic: GROUP
Scope: DETAILED
Level: WARNING
Event ID: 4019
Site: Chai_Wan
Links: [IOCM, IOCM-DC -> IOCM-DR]
Summary: Group in high load -- transfer to be paused temporarily Service Request info:N/A
Then checking the Journal Volume distributor related performance log. There are two metrics:
Distributor phase 2 thread load
Distributor phase 2 effective speed
2013/12/17 16:36:05.869 GMT 2013/12/17 16:37:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 87.272 % of time
2013/12/17 16:37:05.869 GMT 2013/12/17 16:38:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 97.4623 % of time
2013/12/17 16:38:05.869 GMT 2013/12/17 16:39:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 95.9147 % of time
2013/12/17 16:39:05.869 GMT 2013/12/17 16:40:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 90.7287 % of time
2013/12/17 16:40:05.869 GMT 2013/12/17 16:41:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 87.7186 % of time
2013/12/17 16:41:05.869 GMT 2013/12/17 16:42:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 91.9028 % of time
2013/12/17 16:42:05.869 GMT 2013/12/17 16:43:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 91.6016 % of time
2013/12/17 16:43:05.869 GMT 2013/12/17 16:44:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 92.5544 % of time
2013/12/17 16:44:05.869 GMT 2013/12/17 16:45:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 96.4623 % of time
2013/12/17 16:45:05.869 GMT 2013/12/17 16:46:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 85.6245 % of time
2013/12/17 16:46:05.869 GMT 2013/12/17 16:47:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 87.484 % of time
2013/12/17 16:47:05.869 GMT 2013/12/17 16:48:05.869 GMT IOCM IOCM-DR Distributor phase 2 thread load 94.5462 % of time
2013/12/17 16:38:05.869 GMT 2013/12/17 16:39:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 3.63361 Megabytes/sec
2013/12/17 16:39:05.869 GMT 2013/12/17 16:40:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 5.08614 Megabytes/sec
2013/12/17 16:40:05.869 GMT 2013/12/17 16:41:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 5.44455 Megabytes/sec
2013/12/17 16:41:05.869 GMT 2013/12/17 16:42:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 6.10622 Megabytes/sec
2013/12/17 16:42:05.869 GMT 2013/12/17 16:43:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 5.67813 Megabytes/sec
2013/12/17 16:43:05.869 GMT 2013/12/17 16:44:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 4.28963 Megabytes/sec
2013/12/17 16:44:05.869 GMT 2013/12/17 16:45:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 3.59579 Megabytes/sec
2013/12/17 16:45:05.869 GMT 2013/12/17 16:46:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 3.03741 Megabytes/sec
2013/12/17 16:46:05.869 GMT 2013/12/17 16:47:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 2.95798 Megabytes/sec
2013/12/17 16:47:05.869 GMT 2013/12/17 16:48:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 4.20823 Megabytes/sec
2013/12/17 16:48:05.869 GMT 2013/12/17 16:49:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 3.99813 Megabytes/sec
2013/12/17 16:49:05.869 GMT 2013/12/17 16:50:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 4.75266 Megabytes/sec
2013/12/17 16:50:05.869 GMT 2013/12/17 16:51:05.869 GMT RPA1 IOCM IOCM-DR Distributor phase 2 effective speed 7.01075 Megabytes/sec
Through performance log, we can find the Distributor phase 2 thread is highly loaded, but the Distributor phase 2 effective speed is low. That mean the Journal Volume distributing speed is low. Possible reason is SAN problem or storage performance bottleneck.
Above examples introduce how to detect the root cause of two kinds of RecoverPoint high load issues. But the analysis of high load issue is much more than it. For example, the insufficient bandwidth issue may be caused by various of reasons which is beyond RecoverPoint side much than the Networking topic. Regarding the Journal Volume low speed issue, it may be caused by connectivity as well as slow storage which all need further analysis out of ReocverPoint which are not covered in the article. Please note the cause of performance is varied and extend to the rest of devices in storage stacks.
https://emc.my.salesforce.com/ka07000000052Rx
Author: Fenglin Li
iEMC APJ
Please click here for for all contents shared by us.