mgiammarco

1 Rookie

•

29 Posts

0

4441

December 18th, 2021 11:00

Dell R730 loses hard disk

Hi,

I have several Dell R730 the are "losing" hard disks. I mean that, the os says "disk reading error" the controller put the disks in failed state. If I reboot the controller says I have foreign data and I import it.

But if I extract the disks and I do a smart test I see the disks are perfectly working.

So it seems the controller is getting crazy or the drive cage.

So the question is: which part do I need to repair?

Thanks,

Mario

Responses(22)

M

mgiammarco

1 Rookie

•

29 Posts

1

April 9th, 2022 01:00

Changing hardware was not useful. Doing all updates was not useful.

So I went to the bios and changed random options:

- enabled logical processor idling

- disabled sdcard readers

Now the server is perfectly stable and it is a passed a month.

I have other server with sdcard enable and they do not lose disks.

Perhaps saving the bios had cleaned some data.

Mario

DELL-Erman O

Moderator

•

2.8K Posts

0

December 19th, 2021 23:00

Hello,
You can update the firmware first. You can find it here https://dell.to/3pehRWZ

If you have downtime for the server that you can update iDRAC, BIOS, and PERC
Then, you can look at the SEL log to see if there are drive-related errors or if there are warnings such as predictive failure. Then, if you have a backup, you can try replacing it to see if the problem is with the hardware. If you have known good parts that you can try to understand which part is the issue by doing X tests, that is, by replacing the parts with known good parts.

Hope that helps!

M

mgiammarco

1 Rookie

•

29 Posts

0

December 20th, 2021 00:00

Thanks for fast reply!

Yes actually I have updated firmware and I have checked disk health and prefailure status.

But my problem is that I have several old R730 servers and they "lose" disks in a random way.

Do I need to replace all controllers? Or it is a drive cage problem?

DELL-Erman O

Moderator

•

2.8K Posts

0

December 20th, 2021 02:00

I think you are suspecting HDD carrier. This is also a possibility if the Drives do not fit the backplane properly. I would check by doing onsite troubleshooting. You can check the latches.

M

mgiammarco

1 Rookie

•

29 Posts

0

December 21st, 2021 11:00

Is it possible that drive carrier is "drive backplane"?

Thanks,

Mario

DELL-Chris H

Moderator

•

9.5K Posts

0

December 21st, 2021 12:00

Mgiammarco,

No, as they are two seperate things. The carrier will house the hard drive, and the backplane is where the drive will connect to the server.

Also, I would suggest trying to update the server completly to start, and would you confirm the specific error you're seeing?

A

alksj460

3 Posts

0

December 22nd, 2021 04:00

Hello,

Apologies for the interjection, I am having the same issue with an r710. It has been driving me batty for about a month now. Just the last 3.5" slot. Drive failed. 2 minutes later, operating normally. Rinse and repeat every 10 minutes. Ubad, Ugood, online, rebuild, consistency check, all complete. Get home the next day, Yellow lcd, yelllow blink, same. Another oddity, each time it "fails" out of the array and appears back as foriegn, it flip flops from being seen as enclosure 0 slot 5 to just slot 5, no enclosure. Smart self test, short test and extended test passed repeatedly. Full test in the support assist bootable diagnostics passes. No errors in syslog. Taken drive out, no interface on these carriers, backplane out, cleaned everything, checked cable routes, power supplies, all to no avail. Only idea I have left is to break the array and use the manufacturers stand alone diag tools on another system, but that takes more than a few days on a 10 TB. Ideas? (And yes, all firmware has been updated, though I have never seen one for this backplane).

DELL-Marco B

Moderator

•

3.8K Posts

0

December 22nd, 2021 09:00

Hello,

I'm starting this answer with this article that shows how to troubleshooting drives

https://dell.to/3H4IzHr

Please check and make all the suggested steps.

In your case, @alksj460, if diagnostic test out of the OS passes, then probably there are some logical issue on the array.

Which controller is installed? You can verify if there is a sort of check consistency in the BIOS of the controller or via Open Manage Server Administrator.

You said that all firmware are updated, did you check also hard drive firmware?

Thanks

Marco

A

alksj460

3 Posts

0

December 23rd, 2021 04:00

It has a PERC H700 with 512MB cache, battery backed. I have run the consistency checks and the rebuilds thru both the boot interrupting bios interface, and via the perc cli utility. As to the drive firmware, no, there have been no updates put out from the vendor. The drive in question is one of four in a raid 5, two being matched part numbers, and the fourth being a newer model RMA replacement this past spring, at least a few months prior to the issue.Only slot 5 has the issue. I have NOT tried swapping two in the array around, because I do not remember if it can rebuild from that without data loss. (This is a home media and whole home backup, so offloading 26 TB is going to take quite a while, and required an r730xd purchase.) And no data loss is the kicker! Even with this drive supposedly failing numerous times; sometimes filing the SEL in an hour or two; when I set it to good and reinsert it to the VD, It's consistent, patrol reads pass, Smartctl under linux passes all tests, Dell diagnostics pass, and the data is all there, no read or write degradation. No errors in the syslog, SEL just says it failed, then operating normally. I have yet to get OMSA to run on this, but the iDRAC is not logging anything more descriptive. the PERC log has a tad bit more info, but is also like trying to decypher a phone number out of the history of the universe. It is logging SENSE errors, Though I believe these were during boot, mainly b/47/03 and 6/26/00. The b/47/03 looks to be either the cause of or in response to an internal device reset, though as to the requestor, I am unsure. Log snippet below.

12/06/21 21:41:09: iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
12/06/21 21:41:09: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=4433221102000000 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x31120303 IOCStatus x8000 ReasonCode x08 - INTERNAL_DEVICE_RESET
12/06/21 21:41:09: EVT#507548-12/06/21 21:41:09: 113=Unexpected sense: PD 05(e0xff/s5) Path 4433221102000000, CDB: 8e 00 00 00 00 04 8c 3f df d5 00 00 00 08 00 00, Sense: b/47/03
12/06/21 21:41:09: Raw Sense for PD 5: 70 00 0b 00 00 00 00 0a 00 00 00 00 47 03 00 00 00 00
12/06/21 21:41:09: DevId [5] Reduce Queue Depth recursive retry: maxQDepth 1 : maxDepthChanged 1 : curQDepth 1

12/06/21 21:41:09: iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
12/06/21 21:41:09: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=4433221102000000 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x00000000 IOCStatus x0000 ReasonCode x0e - INTERNAL_DEVICE_RESET complete

12/06/21 21:41:09: iopiEvent: EVENT_SAS_DISCOVERY
12/06/21 21:41:09: DM_HandleDiscEvent: Discovery started on Port 0

12/06/21 21:41:09: iopiEvent: MPI2_EVENT_SAS_TOPOLOGY_CHANGE_LIST
12/06/21 21:41:09: DM_HandleTopologyChgEvnt: PhysicalPort=0 NumberOfPhys=x08 NumEntries=x01 StartPhy=x2
12/06/21 21:41:09: ExpStatus=x00 PhysicalPort=0 EnclosureHandle=x0001 Expander devHandle=x0000
12/06/21 21:41:09: Phy changed - phy 02 devHandle 000a linkRate aa curLinkRate a
12/06/21 21:41:10: iopiDiscoveryComplete SubSystem 2 Count 26 InitState 1

12/06/21 21:41:10: iopiEvent: EVENT_SAS_DISCOVERY
12/06/21 21:41:10: DM_HandleDiscEvent: Discovery Completed on Port 0

12/06/21 21:41:10: Disc-prog= 0....resetProg=0 aenCount=0 transit=0
12/06/21 21:41:10: EVT#507549-12/06/21 21:41:10: 113=Unexpected sense: PD 05(e0xff/s5) Path 4433221102000000, CDB: 00 00 00 00 00 00, Sense: 6/29/00
12/06/21 21:41:10: Raw Sense for PD 5: 70 00 06 00 00 00 00 0a 00 00 00 00 29 00 00 00 00 00
12/06/21 21:41:10: DM_REC: Timeout TUR complete DevId 5 rdm 807b8000 sts 2
12/06/21 21:41:10: MPT_REC: Protocol ERROR rdm 807b8000, devId 0005, devHandle 000a, Cmd 03, IoReply c053c238, iocSts 0047, IOCLogInfo 00000000
12/06/21 21:41:10: DM_REC: Aborting recovery SM DevID[5], pRdm 807b8000 RDM FLags=1800004 DevFlags f1400005

12/06/21 21:41:10: iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
12/06/21 21:41:10: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=4433221102000000 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x31120303 IOCStatus x8000 ReasonCode x08 - INTERNAL_DEVICE_RESET
12/06/21 21:41:10: EVT#507550-12/06/21 21:41:10: 113=Unexpected sense: PD 05(e0xff/s5) Path 4433221102000000, CDB: 8e 00 00 00 00 04 8c 3f df d5 00 00 00 08 00 00, Sense: b/47/03
12/06/21 21:41:10: Raw Sense for PD 5: 70 00 0b 00 00 00 00 0a 00 00 00 00 47 03 00 00 00 00
12/06/21 21:41:10: DevId [5] Reduce Queue Depth recursive retry: maxQDepth 1 : maxDepthChanged 1 : curQDepth 1

12/06/21 21:41:10: iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
12/06/21 21:41:10: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=4433221102000000 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x00000000 IOCStatus x0000 ReasonCode x0e - INTERNAL_DEVICE_RESET complete
12/06/21 21:41:10: EVT#507551-12/06/21 21:41:10: 113=Unexpected sense: PD 05(e0xff/s5) Path 4433221102000000, CDB: 00 00 00 00 00 00, Sense: 6/29/00
12/06/21 21:41:10: Raw Sense for PD 5: 70 00 06 00 00 00 00 0a 00 00 00 00 29 00 00 00 00 00
12/06/21 21:41:10: DM_REC: Timeout TUR complete DevId 5 rdm 807b8000 sts 2

12/06/21 21:41:11: iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
12/06/21 21:41:11: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=4433221102000000 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x31120303 IOCStatus x8000 ReasonCode x08 - INTERNAL_DEVICE_RESET
12/06/21 21:41:11: EVT#507552-12/06/21 21:41:11: 113=Unexpected sense: PD 05(e0xff/s5) Path 4433221102000000, CDB: 8e 00 00 00 00 04 8c 3f df d5 00 00 00 08 00 00, Sense: b/47/03
12/06/21 21:41:11: Raw Sense for PD 5: 70 00 0b 00 00 00 00 0a 00 00 00 00 47 03 00 00 00 00
12/06/21 21:41:11: DevId [5] Reduce Queue Depth recursive retry: maxQDepth 1 : maxDepthChanged 1 : curQDepth 1

12/06/21 21:41:11: iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
12/06/21 21:41:11: DM_HandleDevStatusChgEvent: devHandle=x000a SASAdd=4433221102000000 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x00000000 IOCStatus x0000 ReasonCode x0e - INTERNAL_DEVICE_RESET complete

12/06/21 21:41:11: iopiEvent: EVENT_SAS_DISCOVERY
12/06/21 21:41:11: DM_HandleDiscEvent: Discovery started on Port 0

12/06/21 21:41:11: iopiEvent: MPI2_EVENT_SAS_TOPOLOGY_CHANGE_LIST
12/06/21 21:41:11: DM_HandleTopologyChgEvnt: PhysicalPort=0 NumberOfPhys=x08 NumEntries=x01 StartPhy=x2
12/06/21 21:41:11: ExpStatus=x00 PhysicalPort=0 EnclosureHandle=x0001 Expander devHandle=x0000
12/06/21 21:41:11: Phy changed - phy 02 devHandle 000a linkRate aa curLinkRate a
12/06/21 21:41:11: iopiDiscoveryComplete SubSystem 2 Count 27 InitState 1

12/06/21 21:41:11: iopiEvent: EVENT_SAS_DISCOVERY
12/06/21 21:41:11: DM_HandleDiscEvent: Discovery Completed on Port 0

12/06/21 21:41:11: Disc-prog= 0....resetProg=0 aenCount=0 transit=0
12/06/21 21:41:11: EVT#507553-12/06/21 21:41:11: 113=Unexpected sense: PD 05(e0xff/s5) Path 4433221102000000, CDB: 8e 00 00 00 00 04 8c 3f df d5 00 00 00 08 00 00, Sense: 6/29/00
12/06/21 21:41:11: Raw Sense for PD 5: 70 00 06 00 00 00 00 0a 00 00 00 00 29 00 00 00 00 00
12/06/21 21:41:11: Recovered Rdm 807b8000 Pd 5 Sense 6 asc 29 ascq 0
12/06/21 21:41:11: updateSectionsCallback: Error, pd=05, err=2
12/06/21 21:41:12: DevId [5] Restore Queue Depth to 40

DELL-Erman O

Moderator

•

2.8K Posts

0

December 24th, 2021 00:00

Hi,

It can be caused by the firmware of the control not being up to date. As far as I understand and I see, there is no failed drive or predictive failure drive. So it would be good to check the firmware first. If multiple VDs were created from the PERC BIOS interface and there were drives removed, sometimes PERC can get confused and send an invalid command because it cannot access it. Sense codes warnings may also occur when the drives used are non-certified. I would also like to share this wiki page for sense codes that I think will be useful https://dell.to/3qk2PhK
FW update of the hard drive may also be useful. You will see the same warning when you look through OMSA. You can also clear these SEL logs via iDRAC or OMSA. Because SEL logs that are not cleared can be listed to remind you again.
I will actually recommend iDRAC, BIOS and PERC controller updates when it can get downtime. You can also check into Drive FW. You can access FWs from here by entering ST https://dell.to/3mxiKYJ

If the you see same unexpected sense warning then please reseat the hard drive and PERC cables.

Hope that helps!

M

mgiammarco

1 Rookie

•

29 Posts

0

December 24th, 2021 01:00

In my case it happens that a disk is marked failed. Even the os sees read errors.

I take away the disk I put it on another pc I do tests (smart and read tests) and they are all ok.

Then I put again the disk on the R730 and again it fails. But if I put the disk in another slot it does not fail anymore.

M

mgiammarco

1 Rookie

•

29 Posts

0

December 24th, 2021 02:00

I thank you but probably it is difficult to me to explain because I am not english native:

- I manage SEVERAL Dell R730

- it happens in more than one server in more than one slot

- firmware is updated

- sometimes exchanging slot solves problem sometimes not

Should I try to change controller or drive cage or something else?

Thanks again,

Mario

DELL-Erman O

Moderator

•

2.8K Posts

0

December 24th, 2021 02:00

Thanks for updating. It seems related to the slot having an issue.

DELL-Erman O

Moderator

•

2.8K Posts

0

December 24th, 2021 03:00

Hi, I would like you to take a look at the article that Marco shared before. https://dell.to/3JgyQ2H In this article, you can also choose the language on the right side. In the previous post, I thought that the same slot gave an error. But if it sometimes gives an error on the same slot and sometimes it doesn't, then I wouldn't suspect the slot (Backplane). I don't want to mislead you. What is the exact error you are seeing on system logs on iDRAC or OMSA? Can you try unplugging the HDD cables and plugging them back in?

I don't familiar third-party test software. But if I'd see a failed disk on system logs I thought that harddrive has failed. Mostly cases if there is a hardware issue first I would replace the Harddrives, then if the issue continues I would suspect SAS chain parts(it could be Backplane, RAID controller, SAS cables, backplane cables) If you have known good parts you can try to swap the failure part with known good part and try to see which part is causing the problem.

A

alksj460

3 Posts

0

January 1st, 2022 03:00

Mine is fixed, at least for the time being. I wound up taking the system hard down for the holidays.Pulled the drive, more tests that passed. Cleaned the system, moved around some slack on the backplane to perc wiring, vacuum dust, clean and regrease heatsinks. After that it, fired right up, no gripes about failed, failing, degraded, or even wanting a rebuild. Just the normal raid battery fail, good, fail, good routine I have come to expect after its unplugged a few days.

1
2

View All

No Events found!

Rack Servers

Dell R730 loses hard disk

Was this post helpful?