Unsolved
This post is more than 5 years old
1 Message
0
1058
January 31st, 2017 08:00
SDS failure on disk removal
Hello
I am testing a scaleIO environment in our lab running five hosts with ESX-es, with the devices presented as RDM to the SVM-s.
So far it has worked as expected, but I have run into an issue while running failure scenarios.
Namelly, if I remove a drive from a slot, simulating a drive failure, the SDS freezes, and keeps retring to access the drive, instead of marking it as failed and moving on. Instead the whole SDS seems to become unresponsive for storage IO untill remove the failed RDM device from the node comes back alive.
The ESX does discover the RDM device failure, so the handling problem seems to be from the ScaleIO slide.
Currently the disk failure brings the whole SDS down. I can see int /var/log/messages that it keeps retrying the device:
Jan 31 12:04:38 ScaleIO-SUBNET1-105 kernel: [519859.731734] sd 1:0:0:0: timing out command, waited 12s
Jan 31 12:04:38 ScaleIO-SUBNET1-105 kernel: [519859.731738] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jan 31 12:04:38 ScaleIO-SUBNET1-105 kernel: [519859.731741] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 40 00 00 02 00
Jan 31 12:04:38 ScaleIO-SUBNET1-105 kernel: [519859.731746] end_request: I/O error, dev sdb, sector 64
Jan 31 12:04:50 ScaleIO-SUBNET1-105 kernel: [519871.723247] sd 1:0:0:0: timing out command, waited 12s
Jan 31 12:04:50 ScaleIO-SUBNET1-105 kernel: [519871.723252] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jan 31 12:04:50 ScaleIO-SUBNET1-105 kernel: [519871.723255] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 40 00 00 02 00
Jan 31 12:04:50 ScaleIO-SUBNET1-105 kernel: [519871.723260] end_request: I/O error, dev sdb, sector 64
I am running sds-2.0.12000.122 on esx 6.0U2a
Has anyone had any similar situations or indications, where could I configure the system to discard the device after so may retries?
Or a way to have vmware automatically remove an RDM with a backing device PDL...
BaDMaN1
3 Posts
0
February 13th, 2017 05:00
Hi,
I've had similar experiences. What version are you running?
What hardware is this running on?
I'm running on Dell R730xd and have been asked by support to update the raidcontroller firmware and the latest vmware driver.
RHasleton1
73 Posts
0
February 13th, 2017 07:00
Hi MikkM,
Check the state of the device in the OS itself. If the OS still believes the disk is up and running, then ScaleIO may also be trying to still use the disk.
You can check it by running "cat /sys/block//device/state". If it shows running, then the OS is not properly handling the disk failure. If it shows offline, then the OS has done it's job of setting the disk offline for no further use.
If ScaleIO is still trying to use the disk and it really is gone, my guess is it will show "running" from the above command and in that case, the issue is going to be between the ESX host and the SVM and not properly handling the disk failure. You can also set the disk manually offline with this command:
echo offline > /sys/block//device/state
Let us know what the state is and if any of the above info helps.
Thanks,
Rick