1 Rookie
•
4 Posts
0
1369
January 26th, 2023 21:00
Unable to monitor my drives' health
Hi,
About 3 years ago, I bought myself a great server (R820 ; 32 cores ; 256G RAM) for my personal use. I bought it used from a place where they mix and match hardware from used servers to create custom ones. Now, my problem is that I am unable to monitor the health of my drives, neither the Virtual Disk (Optimal, degraded, Offline, ...) nor the physical ones (Good ; Contains Smart Errors, Offline, ...). I did not put too much effort about it so far because the server was at my place and my data were in a tower server.
Now, I am about to deploy it in a data center in colocation, so network monitoring would be much more important. Even more, while getting ready for the move, I did a few attended reboot and noticed that the virtual drive was degraded. Also, 2 drives had smart errors. Still, everything was flashing green... I replaced the worst of the the two drives and 2 more are on their way. All of them were bought new at the same time I bought the server but online from different providers.
So the details are :
R820 ( but it changed a lot since shipped...)
BIOS vers 2.7.0
Lifecycle Controller 2.65.65.65
iDrac Enterprise 7
ESXi 6.7 U3 (Build 17700523) ; customized by DellEMC
PERC H700 (firmware 12.10.3-0001)
Dell 64 Bit uEFI Diagnostics version 4247A1
Dell OS Driver Pack 15.10.03
All of that can be seen from the Drac's web interface. I have full admin control over everything. ESXi is managed from its Web UI and I can access its shell as root. I also have full access to the iDrac and can interact with it from its Web UI or the OpenManage IOS App on my iPhone.
In the iDrac, I have error RAC0501 displayed in the Physical Disks Overview box, itself in first page when I click on Storage.
When I click on Physical Disks in the menu below "Storage", I see RAC0503 : There are no out-of-band capable controllers to be displayed. Check if the host system is powered off or shutdown.
Same when I click Virtual Disks one level below.
Same for Controllers and Enclosures.
When the Lifecycle controller tested the hardware, each and every drive returned error (EFI invalid parameters) 2000-0151.
All the 8 drives are there (Seagate Savvio 900G), configured in a single Raid-10 logical drive. Disk 1 was faulty and is now replaced. Drive 6 shows Smart Errors when I look at it from the PERC's menu if I interupt the boot up and open its menu. That PERC menu sees all the drives without problem.
So, the PERC sees everything and can do it all.
The iDrac sees the PERC but can not do much with it (returns its model and firmware and that's it).
I already did a reset on the iDrac, without any progress. I tried to remove and re-insert the PERC. No gain either. I also installed the iDrac Service Module (3.4.1) in ESXi without getting more information that way either.
Any idea what I should do to probe my server for its virtual and physical drives' detailed status ? I do not mind if it is from ESXi, SNMP, OpenManage or whatever that would do it. I just wish to have a way to monitor these drives.
Thanks in advance and should I have missed some important details, do not hesitate to ask.
DELL-Charles R
Moderator
•
4.4K Posts
1
January 27th, 2023 08:00
Hello Heracles31,
It does look like there are compatibility issues.
With the H700 and non Dell drives that Erman mentions, I also see that ESXi 6.7 is not validated for the R820
Your PowerEdge R820 supports these operating systems:
Up to VMware ESXi 6.5
https://dell.to/3wTeeZD
You can try to install a version of OpenManage Server Administrator to give you management of the drives but note I don't see a version that supports both the R820 and ESXi 6.7.
Dell EMC OpenManage Server Administrator vSphere Installation Bundle (VIB) for ESXi 6.5 U3, v9.4.0 (ESXi 6.7 not listed as supported OS)
https://dell.to/3je7KkD
Dell EMC OpenManage Server Administrator vSphere Installation Bundle (VIB) for ESXi 6.7 U3, v10.3.0.0 (R820 is not listed as compatible system)
https://dell.to/3Hc2upx
Maybe someone in the community has used this combination of hardware and OS and can share their experience.
Heracles31
1 Rookie
•
4 Posts
0
January 27th, 2023 14:00
It worked!
Thanks again! The OpenManage OVF for ESXi 6.7 U3 managed to probe the iDrac and confirmed me that the virtual disk's state is Online as of now. It also gathered a lot of info about the physical drives, despite I am not able to see that drive No6 has smart errors right now. Still, to be able to confirm that the virtual is Online is enough for me.
Indeed, this server is black magic but despite these components were not tested / approved in this combo, they are tolerant and well designed, enough to still manage to run well enough for me.
I will keep playing around this, will try to get that info from ESXi and more.
Million thanks and I really like Dell servers. They are a charm to work with!
DELL-Erman O
Moderator
•
2.8K Posts
1
January 27th, 2023 02:00
Hi, two things draw my attention one of them PERC H700 doesn't seem compatible with R820. I'm also not sure if hard drives are certified by Dell or non-certified. I mean some users might use non-compatible parts on their servers but it can risky regarding the firmware of the part causes some issues. When I read your post, I suspected PERC and HDDs FW couldn't communicate properly.
I can suggest few steps:
2. Disconnect server from all power cables, Network cables.
3. Hold down the power button continuously for at least 10 seconds.
4. Insert power cabless and network cables back to the system.
5. Wait about 2 minutes before powering on server to give the iDRAC time to initialize.
6. Power the system on.
PERC Series Legacy Support Matrix
(Tape)
(x8 hot plug)
(x16 hot plug)
(x4 cabled)
(x8 hot plug)
(x12 hot plug)
Heracles31
1 Rookie
•
4 Posts
0
January 27th, 2023 07:00
Hi,
First, thanks a lot for looking at my case. I already tried the iDrac reset (from the WebUI instead of RACADM) without success but then, I did not drain the flea power. I did as you suggested (SSH and RACADM) and the flea power but unfortunately, with the same result. No out-of-band controller available.
During that time, my second drive with smart errors failed. Virtual Disk ended up degraded again. I removed it and did a few things before putting back the other end-of-life drive that re-silvered despite its own failure. During that time, I looked in the OpenManage IOS App and the server came unhealthy / critical because of that missing drive. Before that, there was not a word about the degraded virtual disk.
You said that the communication problem looks to be between the PERC and its drive. Should that be it, why the PERC would have any problem reporting the status of its virtual drive ? Also, why the PERC can communicate with these drives correctly when I am in its own menu during the boot up and not transfer the info to the iDrac ? As for me, I thought that the problem was more between the iDrac and the PERC.
Still, you clearly know more and better than me about this... Any other idea to try or if it is a lost cause ?
In all cases, thanks again!
DELL-Charles R
Moderator
•
4.4K Posts
1
January 27th, 2023 10:00
Hello Heracles31,
Yes that is the case for the service tag. That is personal information we don't like to expose on the web. The reason is someone could get your service tag and try to contact Dell with it.
Heracles31
1 Rookie
•
4 Posts
0
January 27th, 2023 10:00
Hi,
Thanks for your input. I understand that this server has been built by dark magic and as such, I now end up in a kind of nowhere. I will try the ESXi 6.7 U3 OpenManage kit but should that one be unable to probe the drives, I will give up. I will later at least look to replace the PERC with a supported one or get myself a completely new server, ensuring that every piece is compatible by design.
Thanks again for your support and still happy with my great Dell servers (T110 ; T130 ; T330 and R820) !
(last question : I see that you removed the service tag I listed in my original post. Is that such a sensitive information that it must not be published ? What is so sensitive about it ? Thanks...)