Unsolved
This post is more than 5 years old
2 Intern
•
157 Posts
0
1555
October 31st, 2013 09:00
slow NFS response - high average call time
Recently a bunch of NFS clients began logging messages such as "nfs: [kern.notice] NFS server x.x.x.x not responding still trying". The stats on the backend indicate no heavy loads, no queuing, no SP loads to speak of, and the load on the celerra is just like any other day, including the network traffic in and out. The network guys don't see anything as far as pause frames or errors.
Next will start capturing packets but wondered if anyone has seen this before or if the definition of average call time can be defined. All we know is the clients reflect the slowdown with the spikes seen in the celerra monitor. I've poked all over the various NFS v3 stats and see nothing out of the ordinary, nothing correlates to the call time spikes.
thanks
umichklewis
3 Apprentice
•
1.2K Posts
0
October 31st, 2013 13:00
Are your clients possibly using LACP or EtherChannel? Sometimes a simple misconfiguration cause odd behavior that manifests in lengthy call times. While not directly your issue, I had a problem with HP link-aggregation at the host NIC and iSCSI on the Celerra. Small-block I/O was fine, but any streaming I/O was showing 120ms average wait times and a "permanent" 1 Queue I/O. We disabled LACP on the host and never had a problem since.
Let us know what you find!
Karl
Rainer_EMC
4 Operator
•
8.6K Posts
0
November 2nd, 2013 03:00
Most of these cases are due to the network
Of course the network admins always say everything is fine
downhill2
2 Intern
•
157 Posts
0
November 4th, 2013 10:00
I seem to believe the issue is network related too, I just need to start capturing packets when the problem arises again since it is not practical to save everything that is coming into the machine over the LAN. There is a hunch by the Nexus admin that this is frame size mismatch, but the client and NAS interface match, so I doubt that is it. I will update as we learn more.
thanks
bergec
275 Posts
0
November 4th, 2013 11:00
Check MTU size (is there a VPN?) as well as speed/duplex (should be the same on switch and Data Mover
Check also the retransmit rate (on DM side use "server_netstat server_X -p tcp -s") run the command twice within a few mn interval and calculate the %retransmit
Claude
downhill2
2 Intern
•
157 Posts
0
November 4th, 2013 11:00
Ok, the MTU on the NAS and switch ports at 9000. The clients (for the most part) are as well. There are a few which need to be changed but they oddly enough are not logging any NFS timeout messages. The retrans % for my interval appears to be .0338%.