btrotter

26 Posts

2212

July 17th, 2012 11:00

Microsoft HPC having problems with drive mappings going to CIFS

I am posting this here in the off chance someone may have an idea or can suggest something to check.

We have an Actuarial modeling application called MoSes. It is built on a Microsoft HPC 2008 R2 cluster with a head node and nine compute nodes.

MoSes runs from a Terminal Server which has an HPC integration layer module on it. That allows the end user to submit work from the Terminal Server, which submits it to the HPC cluster.

The MoSes application is set to use a CIFS share on an NS-480 (Celerra) as its working drive.

What we have found is that lately, MoSes has been failing when it tries to run a job with certain parameters set. MoSes tech support has been scratching their heads looking at it, but they found one thing which is interesting. If they point MoSes to use a share created on a Windows server as its working drive, it works. When they repoint it back to the CIFS share, it stops working. Now this is only with specific parameters set. Other times it can read/write to the CIFS share.

From what we have been looking at, there are no error logs, permission problems or anything we can use to start troubleshooting with. It just fails when going to the CIFS share with nothing else to troubleshoot.

Any chance there is any knowledge of the above and could recommend something to look at or try?

Regards.

Responses(7)

btrotter

26 Posts

0

July 24th, 2012 06:00

To give an update. This problem turned out to be an issue with CAVA.

After sending the packet trace to EMC, their engineers analyzed it and determined the smbv2 client was getting a
"STATUS_ACCESS_DENIED" message. This is the comment written by the engineer:

The STATUS_ACCESS_DENIED error should only be generated if the account that is logged in
does not have access to the file attributes due to the ACL's. It would appear that in this instance
the STATUS_ACCESS_DENIED errors are probably being incorrectly generated by the NAS code.

As already suggested, the 'waitTimeout' should set to a large value (e.g., 1000 seconds) for this
customer to produce essentially equivalent processing. The viruschecker.conf file
does not indicate that the “waitTimeout” parameter was set to 1000 seconds.

dynamox

9 Legend

•

20.4K Posts

1

July 17th, 2012 11:00

i would run tcpdump capture while the system is trying to write to the CIFS share. You can open it in Wireshark and see if anything strange stands out, open a ticket with support and attach this capture. Try to filter the capture so it only captures traffic from that specific host IE:

/nas/sbin/server_tcpdump server_2 -start fsn1 -s 400 -host 10.23.4.4 -w /emclogs/intapp1.pcap

To capture external network traffic from the Data Mover, do the following:

1. If /nas/sbin/server_tcpdump does not exist, run the following command to create a link for the server_tcpdump command as a root user:

#ln -s /nas/bin/server_mgr /nas/sbin/server_tcpdump

The syntax for the command is as follows:

usage: server_tcpdump { | ALL }

-start [-promisc] -w

[-host ] [-s ] [-max ]

| -stop

| -display

* The "device" is the interface on the data mover you wish to capture the traffic from, in these examples a trunk device called trk1 is being used, the device name used should be the device name which is assigned the Data Mover IP address you wish to monitor, use the server_ifconfig server_x -a command to determine which device name to use .

* The "outfile" is the name of the file the data captured will be written to, it must be a file on a filesystem mounted on the data mover the capture is run from. So either a file must be created using a customer filesystem or a temporary filesystem can be created to hold the capture file.

* The "host" can be specified by IP address only, name resolution will not be used.

* The "snaplen" is the amount of data from each packet that wiil be captured in (decimal) bytes, (used to limit the amount of data captured, the default capture size is 96 bytes for each packet captured).

* The "max" is the maximum size of the capture file, after the "max" size is reached, a second file with "-1" appended to the original name will be created, the data will be overwritten to these files if the trace is not stopped. Make sure the "max" value is set large enough to capture the needed data, but not so large as to fill up the file system.

Example:

/nas/sbin/server_tcpdump server_2 -start trk1 -w /dm2/tcpdump.cap

2. Monitor the process of the capture by using the following command:

/nas/sbin/server_tcpdump server_x -display

3. Stop the capture by using the following command:

/nas/sbin/server_tcpdump server_x -stop trk1

4. The Linux Control Station can be used to display the capture file or it can viewed in more detail with Wireshark which is available free from www.wireshark.org. To view the capture file using the Control Station issue the following command as root:

/usr/sbin/tcpdump -r /nas/rootfs/slot_2/dm2/tcpdump.cap |more

Some caveats about using tcpdump on Celerra:

* Exercise caution about the amount of data being captured and available space in the file system where the data capture file is being stored. Closely monitor this process, especially if the file system being used is a production file systems.

* Command server_tcpdump supports running simultaneous captures on two interfaces per Data Mover. You must start them separately and they must be saved to different capture files.

* You can unmount a file system to which a capture is writing. When this happens the capture will be put in an error state. You must clean up the capture manually. You will see the error state if you use server_tcpdump -display.

It is recommended that if using this utility for Windows troubleshooting, to use the -s 400 option for complete SMB header information capture. If a client can be identified, use the -host 168.158.xx.xx option for the client IP address.

bergec

275 Posts

1

July 17th, 2012 11:00

Any information on the parameters that cause a failure to the job?

Have you run a trace on the Windows server or opened a case?

Claude

btrotter

26 Posts

0

July 17th, 2012 12:00

One interesting thing I just found out. 2 days before this problem started occuring the NAS DART code was upgraded from 6.0.40.8 to 6.0.60.2.

As for the parameters that change within the MoSes application, it is an application parameter which specifies how the application divides up work and hands it to the compute nodes. If you happen to know MoSes, it is setting a job to run with 1 iteration instead of 1000.

btrotter

26 Posts

0

July 17th, 2012 12:00

Also, I will be opening a case with EMC to troubleshoot further.

dynamox

9 Legend

•

20.4K Posts

0

July 24th, 2012 06:00

why did it cause an issue only when certain params were used in your application ?

btrotter

26 Posts

0

July 24th, 2012 06:00

I cant answer that. I assume the application access data differently with different parameters which causes this to happen.

View All

No Events found!

Celerra

Microsoft HPC having problems with drive mappings going to CIFS

Was this post helpful?