Help diagnosing random reboots on new (to me) hardware?

Question

I recently purchased a 2nd hand Dell Poweredge R6525 to replace an R710. The hardware is part of a proxmox cluster and in theory, is working great. That being said, I am getting random reboots multiple times a week, and I'm having a lot of trouble understanding the root cause. Proxmox is 8.4.1, but I've been having this issue since 8.2.*. On the host, a random reboot is logged, and this reboot is consistent with essentially the power cord being yanked; it is NOT an os initiated reboot or crash:

May 03 11:44:11 sr66-prox-03 systemd[1]: user@0.service: Deactivated successfully.
May 03 11:44:11 sr66-prox-03 systemd[1]: Stopped user@0.service - User Manager for UID 0.
May 03 11:44:11 sr66-prox-03 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
May 03 11:44:11 sr66-prox-03 systemd[1]: run-user-0.mount: Deactivated successfully.
May 03 11:44:11 sr66-prox-03 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
May 03 11:44:11 sr66-prox-03 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
May 03 11:44:11 sr66-prox-03 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
May 03 11:44:11 sr66-prox-03 systemd[1]: user-0.slice: Consumed 2min 14.403s CPU time.
May 03 11:44:36 sr66-prox-03 snmpd[4018]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
May 03 11:45:01 sr66-prox-03 CRON[6331]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 03 11:45:01 sr66-prox-03 CRON[6332]: (root) CMD (for vm in $(/usr/sbin/qm list | awk '{print $1}' | grep -Eo '[0-9]{1,3}'); do if [ $(/usr/sbin/qm guest cmd $vm info 2>&1 | grep -e "not running" | wc -l) -eq 1 ]; then if [ $(/usr/sbin/qm config $vm | grep lock | wc -l) -eq 0 ]; then /usr/sbin/qm reset $vm; fi; fi; done)
May 03 11:45:09 sr66-prox-03 CRON[6331]: pam_unix(cron:session): session closed for user root
May 03 11:45:36 sr66-prox-03 snmpd[4018]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
May 03 11:45:50 sr66-prox-03 chronyd[4053]: Selected source 141.11.228.173 (2.debian.pool.ntp.org)
-- Reboot --
May 03 11:50:02 sr66-prox-03 kernel: Linux version 6.8.12-9-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-9 (2025-03-16T19:18Z) ()
May 03 11:50:02 sr66-prox-03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-9-pve root=UUID=4fbd2c0b-dcd7-44d9-9139-495d8f107f19 ro quiet
May 03 11:50:02 sr66-prox-03 kernel: KERNEL supported cpus:
May 03 11:50:02 sr66-prox-03 kernel: Intel GenuineIntel
May 03 11:50:02 sr66-prox-03 kernel: AMD AuthenticAMD
May 03 11:50:02 sr66-prox-03 kernel: Hygon HygonGenuine
May 03 11:50:02 sr66-prox-03 kernel: Centaur CentaurHauls
May 03 11:50:02 sr66-prox-03 kernel: zhaoxin Shanghai
May 03 11:50:02 sr66-prox-03 kernel: BIOS-provided physical RAM map:
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] usable
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x0000000000100000-0x000000004d501fff] usable
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x000000004d502000-0x000000005550afff] reserved
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x000000005550b000-0x000000005a1cefff] usable
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x000000005a1cf000-0x000000005a3cefff] reserved
May 03 11:50:02 sr66-prox-03 kernel: BIOS-e820: [mem 0x000000005a3cf000-0x0000000067acefff] usable

Below is the idrac logs corresponding to the reboot. There are no logs today, until a "disk 1 in backplane 1 of raid controller in sl 8 was reset" happens, then a system CPU reset happens, then the host iSM loses communication...then the host hard reboots though the idrac doesn't show that, then immediately after the host stops logging I get another backplane log.

The backplane log is pretty confident that is is normal, but I can't help but see it every time the reboot happens; it also happens independent of the reboots (see the log on 5/2, with no reboot). I'm stumped, and it's very much becoming an issue, with 2-3 events happening a week.

I need some advice on how to further diagnose the issue. I've requested help from the proxmox side here, but it seems the OS is not the root cause, and we're now looking for hardware issues and are not properly correlating:

https://forum.proxmox.com/threads/tips-for-diagnosing-the-cause-of-a-host-reboot.157806/

DELL-Marco B · Answer

Sure! Here are some quick steps to diagnose random reboots on your PowerEdge server:

Look for errors around reboot times, ensure everything is up to date.
Run Diagnostics: Use Dell SupportAssist or built-in diagnostics.
Check Hardware Connections: Reseat components like memory and CPUs.
Monitor Temperatures: Ensure components aren't overheating.
Power Supply: Test with a known good power supply.
Minimum to POST: Reduce to essential hardware and add back components one by one.

Thanks

surfrock66-Personal · Answer

These are very general and I've done most of them already:

Look for errors around reboot times, ensure everything is up to date.

-- Logs provided, I'm not seeing any definitive errors. There's some correlation about a disk reset, but that also happens when there isn't a reboot and the message says it's a normal operation, so it feels like a distraction.

Run Diagnostics: Use Dell SupportAssist or built-in diagnostics.

- I will run this, though I'm not sure how it works on a proxmox host

Check Hardware Connections: Reseat components like memory and CPUs.
- Completed, no change

Monitor Temperatures: Ensure components aren't overheating.

- I monitor the systems with zabbix, they are well within ranges. They're in a dedicated server space with an air conditioner.

Power Supply: Test with a known good power supply.
- Completed, it's dual power supplies, both tested well and swapped out. No change

Minimum to POST: Reduce to essential hardware and add back components one by one.
- Completed as possible, the only hardware is cpu, ram, and a fiber card for 10G fiber. The OS can't boot without the fiber card.

Are there any other hardware clues here I can get? I'm going to pursue supportassist, but I would think the idrac logs would be enough for a clue. Should I pursue replacing the backplane, in case that's connecting to the disk issue, even though I suspect it's a red herring?

DELL-Joey C · Answer

Hi,

We're not good with Proxmox, troubleshooting might be limited. But most of Proxmox post I've replied, the RAID controller is being flashed as IT mode? - I don't know much about that. But was any of the R710 hardware being installed in the R6525 or just the data being imported?

For PDR87 error, the physical device is being reset. Perhaps you are right, it could be backplane. Have you tried swapping to other slots or try another spare drive?

Have you also confirm that the server is running on latest firmware?

surfrock66-Personal · Answer

No, it has a PERC H345 Front which can be configured for passthrough to support the ZFS, so we aren't doing anything weird here.

I've ordered a new backplane for $150, and I'll try that. I did try swapping the drives around, but haven't noticed a trend.

The idrac auto-updates the firmware the first Saturday of the month.

I am not sure how to run support assist as it shows as incompatible with debian based distros, and I'm not sure what other diagnostics to run.

Dell-Martin S · Answer

Hi,

you've already done a thorough hardware sanity check, which narrows down the likelihood of simple issues like reseated components, power, or basic temperature concerns. The correlation between disk resets, CPU resets, and reboots suggests a possible hardware fault or firmware/hardware interaction, but pinpointing it is tricky.

Next steps for diagnosis:
1. Check for Hardware Faults via iDRAC
Firmware updates: Keep iDRAC firmware current—known issues sometimes get resolved this way.
iDRAC logs: Double-check for any critical hardware warnings or errors not evident in your current logs, including power supplies, CPUs, memory, and storage components.
Run thorough diagnostics:
Use Dell's Lifecycle Controller Diagnostics—accessible via iDRAC or during POST (F10) — for thorough hardware testing.
Even if these tools are designed primarily for Windows, they often boot from diagnostics USBs or via iDRAC remote virtual media. You can create a Dell diagnostic ISO and run full hardware tests outside the OS.
2. Monitor for Hardware Errors & Correlations
Enable dell_syslog_server integration or check iDRAC telemetrics to log hardware events over time.
Correlate these logs with your reboot times, especially before and after specific events like disk resets or backplane logs.
3. Disable or Bypass Certain Hardware Components
Remove the fiber NIC temporarily if possible or disable it, to rule out NIC-related issues in the reboot cycle.
If the server has IPMI or additional hardware management, monitor those metrics for anomalies.
4. Memory and CPU Testing
Run memtest86+ (bootable from USB) outside of Proxmox to test RAM thoroughly.
Consider swapping to known-good, compatible RAM if feasible.
CPU stress tests (e.g., stress-ng, or running a CPU burn-in tool) can help identify hardware faults, but these require a live OS.
5. Firmware & BIOS
Even though firmware updates are scheduled, double-check and manually verify the latest firmware for BIOS, iDRAC, PERC controller, and other components. Sometimes, new firmware releases fix rare hardware instability issues.
6. Check Power and Environment
Use a power quality monitor (e.g., a kill-a-watt or more advanced device) to verify stable power delivery—none of the logs suggest power issues, but it's worth confirming if you haven't yet.
7. Further Hardware Replacement
Backplane: Considering you've ordered a new backplane, that’s a good move. Since disk resets can sometimes indicate backplane issues, replacing it might help.
Power Supplies: Swapping for known-good units, if not already done.
Motherboard/CPU: As extreme measures, testing with a different CPU/motherboard or running the current setup in a different environment may clarify if hardware failure is inevitable.
Additional notes:
If you can’t run Dell diagnostics directly (since they prefer their utility), creating a bootable diagnostics USB from Dell's official support page is often the easiest way.
Keep an eye on IDRAC logs for specific errors just before or after reboots.
Consider disabling all non-essential hardware one by one (e.g., NICs, PCIe cards) to see if stability improves.

surfrock66-Personal · Answer

Rough update, I replaced the backplane and got another reboot. My monitoring system caught the node reboot at 5:35, and these were the logs from the idrac around that, the reboot happens at the red arrow:

So, I get that disk backplane reset a lot, but it is pretty adamant that it's a normal part of operations, and I'm not clear which disk that even is as I've swapped them all out. Between swapping disks and swapping the backplane, I'm not sure what else to check in that hardware chain, though it's a strong correlation that it happens right before a reboot.

I'm pretty frustrated; I don't have the resources to get new Power Supplies or MB/CPU's right now. The UPS shows clean power and all memtests were fine. Firmware updates are confirmed up to date.

DELL-Joey C · Answer

Hi,

You mentioned swapping the drive around and did not notice any trend. Does it mean that it only occurs on slot 0 and 1? Curious, are the drives Dell drives? Is this server in production? Are you able to test this without booting to OS, say leave it in BIOS mode for a period to check if it's a hardware issue that cause the reboot.

PowerEdge Hardware General

Help diagnosing random reboots on new (to me) hardware?

Was this post helpful?