Start a Conversation

Unsolved

This post is more than 5 years old

1785

October 11th, 2017 12:00

CG always in initialising state???

I'm using EMC RP in conjunction with SRM to create our offsite DR capability.

Our primary site has a vSphere environment of around 20 LUNs/vSphere data stores.

Each LUN is setup in a replica set with a LUN on the remove (DR) site.

All LUNs are part of a single CG. The total size of all LUNs in the primary site is around 50tb.

We have 3 x Journal pools @ 40gb each on the dr site.

The LUNs are setup for synchronous Rep unless a latency of >5ms is observed then it converts to a-sync.

We have found ourself in a position recently (past two weeks) where the CG is constantly in a state of initialisation and we are stumped as to why. The setup was working/initialising relatively quickly upon initial deployment. Over the course of time the CG has grown. This is the only variable to have changed.

The platform was installed by EMC with the single CG model on initial procurement and was validated and working fine.

We have another test CG (much smaller, only 2 x 2tb LUNs) setup in the same way and is functioning as expected.

Any ideas how to troubleshoot or resolve?

Also, what will the impact to our DR capabilities with our live CG consistently being in an initialising state?

Any help would be greatly appreciated,

Jock

2 Intern

 • 

1.1K Posts

October 12th, 2017 01:00

Hi Jock,

I have seen this many times so I am going to make a presumption. It could be that the amount of disks (spindles) allocated to the journal is insufficient to facilitate the performance demands of the write throughput versus the five phase distribution requirement to the journal and replica. Some customers decide to use a small amount of journal disk space because they are using sync and don't necessarily require a longer protection window (PiT rollback history). However, this belies the requirements of the distribution process.

So in the event of slow journal performance due to insufficient disk numbers and ultimately insufficient performance, the system will initially move to three phase distribution and then pause replication to allow the system to play catch-up. You can see this as journal lag for the replica copy. Ultimately this results in initialization behavior because the pause invokes a marking process in the Prod copy journal whereby the metadata of the actual write data is marked in order to be read and then replicated at a point when then system can play catch-up.

Regards,

Rich

5 Posts

October 12th, 2017 09:00

Really appreciate the reply Rich,

Can you advise as to how we can validate that the journal performance is impaired. Are there metrics/thresholds/rules of thumb that we can look at?

Would you advise distributing our problematic CG (CG01) over the two RPA's?

I pulled a couple of screenshots/reports that i would really appreciate if you could provide some feedback on?


Thanks again

RPA1_Traffic.PNG.pngRPA2_Traffic.PNG.pngGroup_LAG.PNG.png

Action may be necessary.

The environment is stable, although groups are not evenly distributed, and this may affect future performance.

To distribute groups evenly across all RPAs, apply the recommended load balancing or manually modify the preferred RPA of each group.

***                                Recommended Load Distribution                                  ***

=====================================================================================================

|         Current          |       Recommended        |   Avg. Throughput    | Avg. Incoming Writes |

|     Preferred RPA        |     Preferred RPA        |       in MB/s        |      per Second      |

=====================================================================================================

| ES-ERP-CG01                                                                                       |

| RPA1                     |      - No Change -       | 13.8032              | 500.026              |

-----------------------------------------------------------------------------------------------------

| ES-ERP-CG03                                                                                       |

| RPA2                     |      - No Change -       | 0.00865437           | 5.82894              |

-----------------------------------------------------------------------------------------------------

| ES-ERP-CG02                                                                                       |

| RPA2                     |      - No Change -       | 0                    | 1.65053              |

-----------------------------------------------------------------------------------------------------

Group ES-ERP-CG01 is not defined as a distributed consistency group, but according to the load balancing analysis it should be. Consider defining ES-ERP-CG01 as a distributed consistency group, and run this command again in 7 days. Note that doing so will cause all journal history to be lost.

Note: The load balancing analysis results have indicated that a new RPA should be added at the cluster to which you are connected. It is recommended that you add the new RPA, set it as the preferred RPA of the group with the highest load, and run the balance_load command again after 7 days.

  *** Traffic per RPA before Application of the Recommendation     ***

  ====================================================================

  |  RPA Name           |   Avg. Throughput   | Avg. Incoming Writes |

  |                     |      in MB/s        |      per Second      |

  ====================================================================

  | RPA1                | 13.8032             | 500.026              |

  --------------------------------------------------------------------

  | RPA2                | 0.00865437          | 7.47947              |

  --------------------------------------------------------------------

  **************************************************

  * RPA2 is transferring the least amount of data. *

  * Average throughput in MB/s         : 0.0086543 *

  * Average incoming writes per second : 7.47947   *

  **************************************************

5 Posts

October 12th, 2017 12:00

Another point to note is that these issues started to manifest themselves after 2 x journal pool creations within a mixed FAST enabled pool (for the impacted CG in question).

I'm starting to read that mixing journal pools with FAST could cause serious issues with performance?

2 Intern

 • 

1.1K Posts

October 13th, 2017 04:00

Oh yeah! I have a long list of do's and do not's with the journal, a best practice breakdown if you like and I must really put them in a Technical Note or Whitepaper. Anyways, try wherever possible to keep the journal separate if you can. The large sequential write and read I/O nature of the journal distribution does not lend itself to the FAST algorithms so there will be little benefit but the data volumes make take performance precedence over the journal volumes. Ideally what you want is a dedicated and homogenous pool for the journal with enough disks to sustain the peak write throughput particularly when replicating synchronously.

I would recommend running detect_bottlenecks and use option 4 for the specific time window of the init and capture the output in the putty log.

Using DCGs may actually make the issue worse depending on your journal capability because you will make it possible to write even faster to the journal!

Feel free to contact me offline if it helps.

Regards,

Rich

5 Posts

October 13th, 2017 13:00

Bottleneck found on group ES-ERP-CG01 on RPA: RPA1 at cluster: em-rp grid: 0

        Remote storage is too slow to handle incoming data rate and regulate the distribution process.

        For normal distribution mode:

          Journal should handle IO of rate                   : 138.286 Megabytes/sec

          Target replication volumes should handle IO of rate: 92.1904 Megabytes/sec

        For fast forward distribution mode:

          Journal should handle                              : 92.1904 Megabytes/sec

          Target replication volumes should handle           : 46.0952 Megabytes/sec

5 Posts

October 13th, 2017 13:00

Hey Rich,

Thanks for your advice...

I have implemented the following changes:

  • Increased the Journal Pool to 1TB (25 * 40gb journal volumes which is the max limit). This only allows me a limax of 2% of my overall CG size (50TB).

  • Do you think it would be wise to break my CG down to approx. 5TB groups so that I can meet the "rule of thumb" 20% journal volume size?

  • I have also removed the journals which were in the mixed pool and now have all journals in a dedicated SAS based RAID pool

  • I have disable TLE and enabled a-sync during initialisation as I have read that this is best practise also.

i've re-run bottlekneck report, output as follows:

====================================================================================================================================

Statistics were found between the times: 2017/10/12 15:31:09.197 GMT ----> 2017/10/13 20:30:53.817 GMT

====================================================================================================================================

System overview of the link on group: ES-ERP-CG01 from site: es-rp to site: em-rp on box: RPA1

Incoming writes rate for link                                      : 14.9308 Megabytes/sec

                                                          Max Value: 186.439 Megabytes/sec

Incoming IOs rate for link                                         : 550.799 IOs/sec

                                                          Max Value: 2545.57 IOs/sec

Total Output rate for link during transfer                         : 7.70773 Megabytes/sec

                                                          Max Value: 18.1582 Megabytes/sec

Initialization output rate for link during init                    : 7.50884 Megabytes/sec

                                                          Max Value: 18.1525 Megabytes/sec

Compression CPU utilization                                        : 6.04526 %

                                                          Max Value: 14.622 %

Percentage time in transfer                                        : 98.8773 % of time

Percentage time of initialization                                  : 98.8771 % of time

Compression ratio                                                  : 1.01065

Deduplication not used                                            

Percentage time of highload                                        : 0 % of time

--------------------------------------------------------------------------------

System overview of the link on group: ES-ERP-CG03 from site: es-rp to site: em-rp on box: RPA2

Incoming writes rate for link                                      : 0.00808709 Megabytes/sec

                                                          Max Value: 0.0766433 Megabytes/sec

Incoming IOs rate for link                                         : 5.73515 IOs/sec

                                                          Max Value: 8.34461 IOs/sec

Total Output rate for link during transfer                         : 3.47637e-06 Megabytes/sec

                                                          Max Value: 0.000189621 Megabytes/sec

Compression CPU utilization                                        : 0 %

                                                          Max Value: 0 %

Percentage time in transfer                                        : 99.1688 % of time

Percentage time of initialization                                  : 0 % of time

Compression ratio                                                  : 0

Deduplication not used                                            

Percentage time of highload                                        : 0 % of time

--------------------------------------------------------------------------------

System overview of the link on group: ES-ERP-CG02 from site: es-rp to site: em-rp on box: RPA2

Incoming writes rate for link                                      : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Incoming IOs rate for link                                         : 1.63707 IOs/sec

                                                          Max Value: 1.71881 IOs/sec

Total Output rate for link during transfer                         : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Compression CPU utilization                                        : 0 %

                                                          Max Value: 0 %

Percentage time in transfer                                        : 99.1688 % of time

Percentage time of initialization                                  : 0 % of time

Compression not used                                              

Deduplication not used                                            

Percentage time of highload                                        : 0 % of time

--------------------------------------------------------------------------------

System overview of RPA: RPA1 on site: em-rp

WAN throughput from this RPA to es-rp                              : 140.45 Megabits/sec

                                                          Max Value: 802.185 Megabits/sec

Total incoming writes rate for RPA                                 : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Incoming IOs rate for RPA                                          : 0 IOs/sec

                                                          Max Value: 0 IOs/sec

Total incoming writes rate for RPA while replicating               : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Incoming IOs rate for RPA while replicating                        : 0 IOs/sec

                                                          Max Value: 0 IOs/sec

Initialization output rate for RPA (average over all period)       : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Compression CPU utilization                                        : 0 %

                                                          Max Value: 0 %

Replication process CPU utilization                                : 6.64473 %

                                                          Max Value: 13.3746 %

Compression not used                                              

Deduplication not used                                            

--------------------------------------------------------------------------------

System overview of RPA: RPA2 on site: em-rp

WAN throughput from this RPA to es-rp                              : 0.115333 Megabits/sec

                                                          Max Value: 0.675494 Megabits/sec

Total incoming writes rate for RPA                                 : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Incoming IOs rate for RPA                                          : 0 IOs/sec

                                                          Max Value: 0 IOs/sec

Total incoming writes rate for RPA while replicating               : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Incoming IOs rate for RPA while replicating                        : 0 IOs/sec

                                                          Max Value: 0 IOs/sec

Initialization output rate for RPA (average over all period)       : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Compression CPU utilization                                        : 0 %

                                                          Max Value: 0 %

Replication process CPU utilization                                : 0.0815534 %

                                                          Max Value: 0.119915 %

Compression not used                                              

Deduplication not used                                            

--------------------------------------------------------------------------------

System overview of RPA: RPA1 on site: es-rp

Total incoming writes rate for RPA                                 : 14.9308 Megabytes/sec

                                                          Max Value: 186.439 Megabytes/sec

Incoming IOs rate for RPA                                          : 550.799 IOs/sec

                                                          Max Value: 2545.57 IOs/sec

Total incoming writes rate for RPA while replicating               : 14.8917 Megabytes/sec

                                                          Max Value: 186.416 Megabytes/sec

Incoming IOs rate for RPA while replicating                        : 600.702 IOs/sec

                                                          Max Value: 2596.12 IOs/sec

Initialization output rate for RPA (average over all period)       : 7.42452 Megabytes/sec

                                                          Max Value: 18.1525 Megabytes/sec

Compression CPU utilization                                        : 5.9774 %

                                                          Max Value: 14.622 %

Replication process CPU utilization                                : 7.33356 %

                                                          Max Value: 13.3847 %

Compression ratio                                                  : 1.01065

Deduplication not used                                            

--------------------------------------------------------------------------------

System overview of RPA: RPA2 on site: es-rp

Total incoming writes rate for RPA                                 : 0.00808709 Megabytes/sec

                                                          Max Value: 0.0766433 Megabytes/sec

Incoming IOs rate for RPA                                          : 7.37222 IOs/sec

                                                          Max Value: 9.99359 IOs/sec

Total incoming writes rate for RPA while replicating               : 0.00682015 Megabytes/sec

                                                          Max Value: 0.0753656 Megabytes/sec

Incoming IOs rate for RPA while replicating                        : 9.96691 IOs/sec

                                                          Max Value: 13.2874 IOs/sec

Initialization output rate for RPA (average over all period)       : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Compression CPU utilization                                        : 0 %

                                                          Max Value: 0 %

Replication process CPU utilization                                : 0.0881831 %

                                                          Max Value: 0.10116 %

Compression ratio                                                  : 0

Deduplication not used                                            

--------------------------------------------------------------------------------

System overview of site: em-rp

WAN throughput from this cluster to es-rp                          : 140.566 Megabits/sec

                                                          Max Value: 802.308 Megabits/sec

Total incoming writes rate for cluster                                : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Incoming IOs rate for cluster                                         : 0 IOs/sec

                                                          Max Value: 0 IOs/sec

Total incoming writes rate for cluster while replicating              : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Incoming IOs rate for cluster while replicating                       : 0 IOs/sec

                                                          Max Value: 0 IOs/sec

Initialization output rate for cluster (average over all period)      : 0 Megabytes/sec

                                                          Max Value: 0 Megabytes/sec

Compression CPU utilization                                        : 0 %

                                                          Max Value: 0 %

Compression not used                                              

Deduplication not used                                            

--------------------------------------------------------------------------------

System overview of site: es-rp

Total incoming writes rate for cluster                                : 14.9388 Megabytes/sec

                                                          Max Value: 186.447 Megabytes/sec

Incoming IOs rate for cluster                                         : 558.171 IOs/sec

                                                          Max Value: 2552.9 IOs/sec

Total incoming writes rate for cluster while replicating              : 14.8985 Megabytes/sec

                                                          Max Value: 186.424 Megabytes/sec

Incoming IOs rate for cluster while replicating                       : 610.669 IOs/sec

                                                          Max Value: 2605.99 IOs/sec

Initialization output rate for cluster (average over all period)      : 7.42452 Megabytes/sec

                                                          Max Value: 18.1525 Megabytes/sec

Compression CPU utilization                                        : 5.9774 %

                                                          Max Value: 14.622 %

Compression ratio                                                  : 1.01065

Deduplication not used                                            

--------------------------------------------------------------------------------

2 Intern

 • 

1.1K Posts

October 16th, 2017 15:00

So what does the journal pool currently consist of in terms of disk type and numbers?

No Events found!

Top