Unsolved
This post is more than 5 years old
10 Posts
0
1390
July 25th, 2013 07:00
Dual Drive Ownership - Multiple Raid Groups or a Single Raid Group?
Hello All!
We are in the process of reconfiguring our SSD backed CX4-120 (single shelf with 15 disks) and we ran across the following note in the EMC Clariion Best Practices for Performance and Availability regarding dual drive ownership. (screenshot because this PDF must not allow copy/paste).
The key points for me are:
- Drives are dual ported and can accept IO from both SP's at the same time.
- Dual ownership may result in less predictable drive behavior (higher response time) than single ownership because of deeper drive queue lengths.
- Single SP ownership of a drive is preferred
- I am assuming all of the above points are valid in the context of SSD as well.
The catch for us is that we are not adding another shelf of disks and creating a second raid group. That would make things very easy. Instead, we are considering carving the existing 15 disk shelf (SSD) into two Raid Groups rather than creating a single 15 disk Raid Group. That will cut the total IOPs/throughput available to a single SP/Raid Group in half but the total available throughput will be the same. (all LUNs belonging to a Raid Group would be owned by one SP)
We tried the single 15 disk raid group, werent completely satisfied with the performance and then came across the Clariion Performance and Availability Guide. That is why we are now considering two raid groups instead of one.
In this particular scenario, would carving up the SSD disks into two separate Raid Groups make any difference one way or another?
Thanks!
RICKSHAW1
10 Posts
0
July 25th, 2013 10:00
Response time. The latency wasn't as low as we would have liked it to be.
Storagesavvy
474 Posts
0
July 25th, 2013 10:00
What was the issue with performance in your first layout that you had? ie: what performance problem are you trying to solve? response time? IOPS? Bandwidth? SP Utilization imbalance? etc.
Each layout has various pro's and con's and depending on the issue you are trying to solve, you could make the performance worse than what you had already achieved.
zhouzengchao
4 Operator
•
1.4K Posts
1
July 26th, 2013 01:00
Single 15 SSD RAID group can support large amount of IOPS. If that still cannot satisfy your needs on response time, then you should take a look at the application profile, which may not be IOPS-intensive, but rather bandwidth intensive. I don't think cut the RG into two RAID Group would solve the problem.
You should take a look at the queue length, utilization of your SSD drive (not LUN level), higher response time mostly caused by deeper queue length, which means your SSD might be overloaded. You should make sure the 15 SSD could support your whole application IOPS, otherwise any tuning or re-layout would be no difference.
Storagesavvy
474 Posts
1
July 26th, 2013 09:00
Keep in mind that most of the recommendations you will see come out of the nuances of spinning disks.. As you deepen the queue on a rotating drive you increase latency because the head is constantly moving around to look for the data you are trying to read/write. For Flash, there is no such rotational latency, as there are no moving parts.. So much of the old best practices make little difference. Assuming you have only one LUN, breaking it up into multiple LUNs and striping the data at the host could help lower latency. The multiple LUNs can live in the same RG, but should be evenly distributed across both SPs.
As Steve mentions, the SSD has a finite amount of IOPS available which varies by IO size, etc. If your IO size is large (256KB, etc) then you will have higher response times compared to small IO. Definitely check the queue lengths, IOPS, and response time at the individual drive level and see what that looks like. If, for example, the host seems latency as 20ms, but the disk latency is 1ms, you probably need to add more IO queues between the host and the storage. You do that by adding SAN paths (with PowerPath) and/or increasing the number of LUNs being used.
RICKSHAW1
10 Posts
0
July 29th, 2013 12:00
Thanks Richard and Steve. Very helpful responses. Part of our challenge is that we don't have an Analyzer license so seeing disk queue lengths, response times, SP utilization, etc on the Clariion requires opening a case with EMC to have them decrypt and analyze NAZ file. Not sure why the decision was made to forgo an analyzer license - having EMC analyze the NAZ file takes way too long.
The workload (MS SQL) in question is both IOP and bandwidth intensive relative to other applications in our environment. We have observed this particular instance of MS SQL doing 210+ MB/sec and 4250 ops/sec (mostly read). If my calculations are correct, that is roughly an average of a 50K block size so we are probably dealing with quite a few 64K blocks.
Based on the sizing numbers provided by our EMC representative, it looks like we are exceeding the throughput maximums for our Clariion. He didnt give us numbers at a 64K block size, however we have almost doubled what he said the array can handle at an 8K block size and a 70r / 30w ratio. Our goal now is just to sqeeze as much performance as possible out of the Clariion which is why the discussion about single verus dual drive ownership and one or two raid groups came up.
Thanks for pointing out the possiblity of deep queue lenghts. I'll definitely look into that a bit more. Also, if you two are aware of any alternatives to Analyzer when it comes to monitoring Clariion performance, I'd love to hear about it.
Thanks again!
kelleg
4.5K Posts
1
July 29th, 2013 13:00
You also might want to consider the single backend bus on the CX4-120 as the bottleneck. This is a 4Gb bus or 400MB/s (more like 320MB/s with the FC overhead). As each SSD can potentially handle 100MB/s, four SSD's can overload the backend bus.
There are also certain IO loads that are not as well suited to SSD as other workloads. Take a look at the attached docuement on page 17
Glen
1 Attachment
Unified Flash Drive Technology Technical Notes 2010-10.pdf
RICKSHAW1
10 Posts
0
July 30th, 2013 08:00
Ah...we never did take the backend bus into consideration. Thanks for the heads up on that. I will be adding that to my list of items to consider when analyzing performance going forward.
I dont think we are pushing 320 MB/s from the workload the Clariion is hosting. I will admit that it is difficult to be certain without an Analyzer license. To make things more interesting, we are virtualizing the Clariion behind a NetApp V-series array. So the write IO profile we see on the MS SQL server ends up in NetApp's write cache. As I understand it, NetApp's write cache then "optimizes" the writes and flushes it's data to the Clariion LUNs. Due to the Netapp virtualization layer, I do not believe we can use the IO profile we see between the SQL server and the NetApp as indicator of the IO profile between the Netapp and the Clariion. There are some metrics we can gather from the NetApp initiator ports but it certainly would be nice to see stats on the Clariion array.
Thank you very much for the Unified Flash Drive Technology Technical Notes PDF. I didnt know such a document existed and it really answered a lot of questions I had about SSD. Very concise and informative read.
Storagesavvy
474 Posts
1
July 30th, 2013 09:00
To get a little help, do one or both of the following...
1.) Go to support.emc.com, search for VNX Monitoring and Reporting, and download it. Install it on a Windows server and configure it according to the directions. (Navisphere CLI is also required). Should take no more than 10 minutes to install and get data collecting from the array. VNX M+R requires a license, but a 90 day eval license is included in the download. This may help you with getting data you need...
2.) In Unisphere, configure performance collection and run it for a day at 5 minute archive interval (360 seconds). Download the .NAZ file and PM me for an upload link or use some service like drop box, etc to get it to me. I'll take a look.
Adding v-series in front of the array completely changes the IO pattern that the array will see from what the host is issuing since Ontap handles the host IO and then issues IO's the backend array based on cache hits, write coalescing, etc.
TimH
100 Posts
0
July 30th, 2013 20:00
Rickshaw,
Has this question been answered? Please update and/or mark it answered if so.
Thank you.
Tim
RICKSHAW1
10 Posts
0
July 31st, 2013 13:00
Richard - thanks for the information regarding the VNX Monitoring and Reporting tool. I am definitely going to check it out. I'll give that a shot first and let me know if I need to take you up on the offer to analyze the NAZ file.
RICKSHAW1
10 Posts
0
August 1st, 2013 08:00
Tim - no problem. The question hasnt quite been answered directly but the responses have been invaluable to helping me analyze performance and determine our best option to squeeze the most performance out of the array for this particular workload.
I believe the answer may be "it depends" so I'll be sure to update this post with what worked best for us and mark the question as answered at that time.
TimH
100 Posts
0
September 6th, 2013 10:00
So what was the outcome?
kelleg
4.5K Posts
0
September 20th, 2013 14:00
Was your question answered correctly? If so, please remember to mark your question Answered when you get the correct answer and award points to the person providing the answer. This helps others searching for a similar issue.
glen