Unsolved
This post is more than 5 years old
9 Posts
0
1094
March 27th, 2015 08:00
VMAX20K Performance Issue--maybe.
Customer states his batch job, which normally runs for 12hrs 30mins, is running 2hrs 15mins longer on the VMAX20K.
The latest batch job ran for 14hrs 15mins. I collected a number of performance statistics, some are listed below. The host has a single 1TB, sixteen member thin meta, under FAST VP control with a 100,100,100 policy. The device has 689GB in the FC tier and 187GB in the SATA tier. I've never seen any allocated space move up to the FLASH tier. The host is zoned to 4 FAs and PowerPath is configured.
Reads
IOPs 11.8 to 1162.9 average 239.2
MB/s .1 to 5.1 average 1.1
KBs .8 to 43 average 6.7
RT(ms) .3 to 5.6 average 1.3
Hit % 74.9 to 100 average 92
Writes
IOPs 1.3 to 832.3 average 141
MB/s .1 to 4.9 average .75
KBs .9 to 36.8 average 9.6
RT(ms) .4 to 3.6 average .9
Hit% 98.4 to 100 average 96.4
Prefetch Tracks/sec .1 to 70.1 average 17
Prefetch Tracks/Used 0
Near as I can see the entire device lives in cache, and response times are excellent. The back-end doesn't appear to be over-loaded, although I'm not sure what penalty, if any, a prefetching brings if no prefetched tracks are used. The front-end does have spikes where it reaches 70% busy , and over, but the low response times and high cache hits I would think trump how busy the FAs are. Outside of the fact the IOs are small, and the reads are 60% random, I'm having a hard time understanding why the perceived performance is so poor.
PedalHarder
3 Apprentice
•
465 Posts
0
March 29th, 2015 15:00
Unfortunately, there are so many areas to investigate based on your description.
First, do you know that it is storage that has contributed to the blow out?
Was there any changes between the two batch runs? Such as...
- code upgrade
- DB system changes
- Batch scheduling changes
- A new host using the FA ports (are the FA queue metrics and CPU% about the same on each run?)
Was a regular activity NOT performed such as weekly reorg?
Has the number or records in the DB increased suddenly? You can take the IOPS in each interval overt the elapsed time of the batch to check this - or confirm with DBA.
Has the host hit a CPU or memory constraint?
Is there a fabric bottleneck or slow drain device? Check the port error counters to see if there is a buffer credit or fabric infrastructure issue.
I suggest to take a close look - interval by interval comparison of the response times and reads, writes. If there is a 17% difference then more investigation required in the array to see why. Otherwise the it is likely the difference is due to a factor outside the array.
dtaraian
9 Posts
0
March 30th, 2015 05:00
Thanks for the reply, Jason. I suppose at this point I'm just trying to reconcile what I see as great response by the array, and the users perceived performance. We've since moved the application to another array, and although it's running better there, the user is still no realizing the 12hr 30min targeted time. That being said, given the low response times, and high cache hit-to-miss ratio, what would you have focused on to get a clearer understanding of what was going on in the VMAX?
rawstorage
3 Apprentice
•
420 Posts
0
March 30th, 2015 06:00
dtarian,
if you're looking at an Oracle DB you may want to have a look at Database Storage Advisor, This is part of the latest Unisphere 8.x suite and as far as I know doesn't require any additional licensing. It could be that the perceived performance is further up the stack, DSA would help you see this if there is a big difference between the performance the array is seeing vs what the application is seeing. You should also be able to see where the wait time is being spent while the job is running and work out a possible action plan.
dtaraian
9 Posts
0
March 30th, 2015 07:00
Since the DB was re-located to a different array, and most of the performance stats have aged out of Unisphere, I'm viewing this now as more of a training/learning exercise. So my question comes down to this--given above array stats (which look good) what else should I have looked at to single out the array as the problem?
Thanks.
PedalHarder
3 Apprentice
•
465 Posts
0
March 30th, 2015 17:00
Can you confirm that the array response times reconcile with the host recorded response times?
dtaraian
9 Posts
0
April 10th, 2015 10:00
Hi Jason, this was part of the larger problem. The application folks weren't willing to run any perf monitoring on their side, so I was in the position of having to prove it wasn't a storage problem. So it goes back to my original question--outside of the metrics above is there something else I should have looked at on the array?
Thanks.
PedalHarder
3 Apprentice
•
465 Posts
0
April 10th, 2015 11:00
Its tough when you don't have a customer that wants to collaborate or have the evidence to compare the host and array stats. IMHO, without the data you can't tell if you are dealing with an actual performance issue or the perception of a performance issue. What you do have in this case is very good evidence of the array performing well. The response time metrics you have speak for the performance of the array and you should have confidence that from an array perspective, there is no performance issue. I hope you can get the opportunity to get back to the customer and talk about the best way to handle a reported performance in the future. An evidence based approach is the only way!
Remember also that performance can be recorded at the application/db layer as well as the platform/os layer. If these are different teams, maybe you can work around your road blocks.