firozg

41 Posts

2191

November 28th, 2012 15:00

Application slowness issue

Hi All

I am facing issue in analyzing the problem with one of the 1000 VMs on our CX4-960. We got ESX4.0 with 1000+ VMs in multiple clusters. Linux RHEL 4.0 on an ESX 4.0 connected to CX4-960 4.30.xx.xx.525, This VM is mounted on a 6 different luns with different raidgroups on the array. The VM has an Oracle DB running on it. The problem statement is they see high IO wait only during 7:00 to 08:00 am GMT & thats the time when 2 different activity happens (1) Users login in (2) Oracle schedule jobs are run. The rest of the day it is running perfectly fine with no issues. Server team claims CPU/Memory of the server to be under control & only bottleneck they see is IOwait. So I ran NAR files for couple of days and this is what I see.

Out of the 6 luns their are 2 luns of RAID 10. One of the lun is on a 4+4 RG where as other one is a metalun spread on a two different RGs of (4+4 i.e. 16 spindles) I have analyzed all the 6 luns and found only one 350 GB luns which is a metalun residing on a 2 x 8 spindles RAID 10 RG

This is a striped metalun of 175 GB each carved on a (4+4) 2 different RG R1_0 (FC 15krpm drives)

Below is the output i see on the NAR files

Lun utilization during the slowness period is 100%

Lun total throughput is 160 IOPS

r/w ratios is 80:80 iops

though read bandwidth is hardly an mbps but writes are big chunks upto 8-10 mbps with filesize over 128kbps

Lun doesn't show SP forced flush not even once

SP utilization during this period is hardly 45% (both SPs with sp cache being within 60-80 watermark) no flush occuring

Lun queue length is between 8 to 14 same as ABQL

Lun response time seems to be around 300 ms but i dont think this is any problem as its been processed by cache

I drilled down to the Raidgroups/disks and found it to be very optimal and underutilized

Disk utilization of both the RGs is under 10%

Disk queue length of both RGs is 0.50 to 0.80

Disk response tims is ranging from 15 ms to 69 ms (this i feel is an issue)

each disks in both the RGs are transfering @ the rate of 1-2mbps

IOPS of each drive is not above 40 IOPS

write files size on each drive seems higher than 96 kbps.

read & write bandwidth both are below 1 mbps

ABQL is between 8 to 12

I verified all the other luns in the two RGs are very quite during the problem period basically nails down to this lun & its component lun in the other RGs

How would I judge if this lun really has issues in getting IOs processed. Though my Disk respone time is higher but i still dont see a single spike on my SP Cache. Write file size are bigger and so the bandwidth too is higher on the lun but the same is low in my drives.

Is their any other parameter which can give me a clue here? its been 3 days since the problem is appearing & its hitting hard in my nerves.

I am not knowledgeable on Linux stuff and the server owner loaded me with the IO outputs when asked and was only able to drill down to the devices SDC/SDC1. I have attached the output in the text format to this discussion.

This lun is in mirrorview/A relationship

any advice at the earliest to this will be highly appreciated

thanks

Firoz

3 Attachments

io.out26112012

WORKLOAD REPOSITORY report for UKPRHRDB02.docx

CKM00094500211_SPA_2012-11-28_10-46-43-GMT_P00-00.nar

Responses(8)

Storagesavvy

474 Posts

0

November 29th, 2012 10:00

Firoz,

I took a look at the NAR and it looks like the beginning of the NAR starts about 7:15am (15 minutes in to the 7-8am time frame you are concerned about). Can you confirm that? Assuming that's true, here's what I see..

From 7-8am there is a batch job, backup, or some other activity occuring on Meta180 that drives utilization up to about 10MB/sec of writes, with write IO sizes around 200-300KB. Response time climbs up over 300ms, which is likely contributing to the IO waits on the host. Looking at write cache stats, it seems these writes are random, ie: very little locality. FASTCache hit % is low during that time and the SP Write Cache rehit ratio is near zero.

The queue length on LUN 180, it's component LUNs, and the underlying disks are all reasonable which points to a bottleneck elsewhere. I suspect two possible causes right now...

1.) Copy on First Write activity related to MV/A during the 7-8am window is causing high response times. It looks like MV/A updates about every 30 minutes and in this NAR there is an update finishing around 7:20 and another one starts around 7:45. During the actual active updates, any writes to a block that is part of the update requires a COFW to the Reserve LUN Pool. Since these writes are large, a single 256KB write requires 4 chucks to be transferred to the RLP, which actually translates to 16 IO's since the RLP is RAID5. The RLP LUN (621) shows as 100% busy during the high response time period. Also, since the writes have low locality of reference, there is likelihood of large amounts of RLP copies during that time.

a.) Moving the RLP LUNs to RAID10 could help

As a test, see if you can disable the automatic MV/A updates for an evening and let the 7-8am period run through with no MV/A update. If there is no active update, there will not be any COFW activity and the Reserve LUN Pool will be out of the picture. If the response time improves you have a place to focus.

2.) It's also possible that the host needs more disk IO queues. If the process they are running is highly multi-threaded, you could be bottlenecking at the host IO queue side. The wait to solve this is to present more than one array LUN to the host. Currently, even though you have the MetaLUN with two components across 16 disks, the host sees just one LUN and one queue. If you did away with the Meta, and just created two LUNs, one in each RG, assigned one to SPA and the other to SPB, and then had the host stripe across those LUNs with whatever filesystem it has (ASM, VxFS, ZFS, etc) you'd double the number of IO queues at the host, balance the load across the two SPs, and the second LUN would get it's own RLP LUN as well. If the IO queue is the bottleneck, this would improve performance, even though you are using the same two 4+4 RGs.

BIG CAVEAT: If you use two LUNs instead of a single MetaLUN, you MUST put the two LUNs into a consistency group for Mirrorview or you won't be able to recover the data on the target.

Separately from the meta LUN performance, you are also seeing Queue Full's pretty much all day on some of the front end ports. Ports 5 and 6 on both SPs are the worst, with B5 and B6 being the standouts. It would be good to spend some time checking HBA execution throttles and numbers of initiators per port to see if you can adjust the throttles down and alleviate the port queuing. Each front end port has an effective queue depth of 1600 so if many hosts are sharing the same port and they all have high execution throttles, the port queue can get filled up. Queue Fulls result in periodic slowdowns at the host side.

Hope that helps a bit.

Richard

Storagesavvy

474 Posts

1

November 28th, 2012 16:00

Since the busy LUN is in an MV/A relationship, you should also check the performance of the Reserve LUN Pool disks. heavy large writes (which you seem to indicate) may cause contention in the RLP which would slow down the production LUN. Typically this would only occur if there is an active MV/A update, since MV/A only maintains snapshot tracking during an update.

Any chance you can provide the whole NAR file?

firozg

41 Posts

0

November 28th, 2012 16:00

Hi Richard

Thanks for pointing out a very important part of the whole analysis. I have attached the NAR files to this.

few more details. The MetaLUN number is 180 in RG no 76 & 78. (Base lun 180 & its component lun 181). The time when the problem occurs is between 07:00 hrs to 07:45 hrs GMT.

let me know if you need anything else.

The RLPs are in a dedicated RGs. i.e 62, 63, 64, 65, 100 and 101

thanks

Firoz

firozg

41 Posts

0

November 29th, 2012 13:00

HI Richard

Thanks and appreciate your research and the time you spent to analyze it. Later today their have been few developments. I got in touch with an Oracle expert to get a 2nd suggestion on why the redo logs are writing with big sizes. Forward to this I got the AWR report of the DB captured during the problem time i.e. 07:00 to 08:00 am. My friend who has an over 10+ yrs experience on oracle systems helped me with his findings on the report. I have copied the gist below on what he suggested as it points towards a known bug on the system.

Also attached the AWR report to this thread, if any DBA wanna have a look at it & put some thought process.

AWR output summary is below

As per attached AWR report your pain are is event “cursor: pin S wait on X” in Top 5 Timed Foreground Events section of ur AWR report which you can see in first below screen shot. And because of this only single execution for SQLID which oracle default job taking 1865 seconds which is very high. This is Oracle by default job and issue not related to storage. DBA team must open SR with oracle it could be BUG which has been fixed in version 11.2.0.2.2 or would be great to upgrade in 11.2.0.3.

Rest other details also you can check for this query where it is taking time best suggestion SR should be open with oracle:

Second you can check below parameter with oracle DBA team if it is true then make false to this may resolve your issue.

cursor: Pin S wait on X" wait event, maybe mutex problem.
from 10.2.0.2, oracle use mutex instead of library cache latch, in case "_kks_use_mutex_pin"=true.
so, you change that parameter value from true to false.

SQL>alter system set "_kks_use_mutex_pin"=false scope=spfile;

SQL>shutdown immediate

SQL>startup

[https://metalink2.oracle.com/metalink/plsql/f?p=130:14:8905679782306617791::::p14_database_id,p14_docid,p14_show_header,p14_show_help,p14_black_frame,p14_font:NOT,564719.1,1,1,1,helvetica]

This is the query which is hitting performance:

==================================================================================

As per his suggestions it looks like more of a DBA kinda stuff but i just need a 2nd thought on it.

I have changed the update to manual to see if it makes any difference.

regards

Firoz

firozg

41 Posts

0

November 29th, 2012 14:00

Hi Richard

I have revert the change i made on the mirrorview /A sync & changed it back to automatic but I did realize that the RLP Lun 621 had its write/read cache disabled. I remember disabling cache on some RLP luns long ago reduce its cache usage which were originally resulting in Forced flushes.

I have enable the r/w cache on the RLP lun 621 and is in monitoring till 30th Nov 07:00 am to 08:00 am.

Also asked the DBA team to check with oracle on the points noted by my friend on the AWR report.

I will update you if tomorrow morning I see any change by enabling the cache on it

thanks again

Firoz

Storagesavvy

474 Posts

0

November 29th, 2012 14:00

Ahh.. I noticed the cache check boxes were unchecked but when I look at NAR files offline I don’t always trust those things.. Good to check that. Let us know how that affects it.

As with any application, if the app is doing something to drive high loads that are unnecessarily for the applications purpose, definitely try to resolve that prior to re-architecting any backend storage.

Richard J Anderson

firozg

41 Posts

0

December 16th, 2012 11:00

Hello Richard

We were observing the application performance for the past two weeks since I enabled the cache on the RLP lun and it looks like that was the only problem. I ran NAR files again and this time the Lun response time drastically came down to 10-15 ms from 300 ms. Thank you very much to point out the very important area which I missed originally,

Anyways I will close this discussion as the problem didnot occur for the past two weeks.

thanks again

Storagesavvy

474 Posts

0

December 17th, 2012 10:00

Great to hear! Thanks for the follow up.

Richard J Anderson

View All

No Events found!

CLARiiON

Application slowness issue

Was this post helpful?