Start a Conversation

Unsolved

E

1 Rookie

 • 

20 Posts

66

February 2nd, 2025 05:58

T630 powering off (even iDRAC)

I have been given an old T630 which has spent a bit of time in a light industrial environment, some exposure to dust and heat. It works OK except when it turns off and it doesn't log anything in the iDRAC logs.

The first time the problem happened I was using it and the backlight of the front panel LCD display started flickering and then it went off. Powering it off and on again didn't help. A week or so later I turned it on again and it was fine. Last night I was working on it and it was fine at about 2AM, but then at around 4AM it turned itself off and wasn't even accessible from iDRAC. Both PSUs had the green lights indicating that they were fine but the power button did nothing. At the time it was running BOINC on all CPU cores and the room was maybe a bit over 25C. But prior to the outage it didn't have the fans running at full speed so it wasn't really hot.

Based on other advice here I turned it off at the wall and held the power button for over 10 seconds but that didn't fix it. Based on other advice here I turned it off at the wall for more than 5 minutes and then turned it on again and it worked correctly.

I think this is an ongoing issue and likely to recur and it may be correlated to heat. The system had previously run with 18 HDDs and CPU intensive work in situations that were hotter, so 4 SSDs and BOINC shouldn't have been a problem for it.

Is there a board that takes power from the 2 PSUs which could have broken? If so how do I determine that it's broken?

Moderator

 • 

5.1K Posts

February 3rd, 2025 00:39

Hello, thanks for choosing Dell.

First thing I’ll recommend is that you need to do minimum to POST. It’s what we do when you can’t access to idrac.

 

 

Minimum configuration to POST

The components mentioned below are the minimum configuration to POST: 

● One processor (CPU) in socket processor 1 

● One memory module (DIMM) in socket A1 

● One power supply unit 

● System board 

● Control panel

 

 

Let us know how it goes.

 

Respectfully,

3 Apprentice

 • 

1K Posts

February 3rd, 2025 15:49

I would agree with you that something is getting hot and shutting off the system. let me ask a few questions:
1. have you tried using compressed air to blow out the system board and power supplies since you recvd the system?
2. Assuming the unit has 2 power supplies, try running it on one to see if the problem follows a specific psu. 
3. This 3rd one is a bit of a stretch on my part, but does this unit have one or two CPUs? if it has 2, you could try removing one CPU to see how it runs, and its possible that the thermal paste on the CPU(s) is going bad and not conducting heat to the heat sink, and the system is turning itself off. I'm not sure if the idrac would still be accessible if the CPU stopped working due to heat. 

Rey
#Iwork4Dell

1 Rookie

 • 

20 Posts

February 4th, 2025 00:17

Young: I configured it with a single DIMM, no PCIe cards, both PSUs, and only 4 SSDs (none of which were installed when the problem first occurred) and it happened after about 8 hours running. I had a script running the "sensors" program to report temperatures every 30 seconds and here's the output showing that it was running at a reasonable temperature:

i350bb-pci-0100
Adapter: PCI adapter
loc1:         +47.0°C  (high = +120.0°C, crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +49.0°C  (high = +75.0°C, crit = +85.0°C)
Core 0:        +47.0°C  (high = +75.0°C, crit = +85.0°C)
Core 1:        +45.0°C  (high = +75.0°C, crit = +85.0°C)
Core 2:        +46.0°C  (high = +75.0°C, crit = +85.0°C)
Core 3:        +44.0°C  (high = +75.0°C, crit = +85.0°C)
Core 4:        +45.0°C  (high = +75.0°C, crit = +85.0°C)
Core 5:        +43.0°C  (high = +75.0°C, crit = +85.0°C)

1 Rookie

 • 

20 Posts

February 4th, 2025 00:43

Rey: 1) Good idea. I gave that a go but didn't find much dust in it. There's a whole section under the motherboard where a 0X7C1K could be installed which I haven't looked at yet. I want to do the tests that can be done without totally taking it apart first.

2) This problem first occurred when I was running it on one PSU, but I didn't record which PSU it was. I will do a further set of tests to find out if it occurs on a single PSU and whether both PSUs trigger it.

3) The system has 1 CPU and it's a E5-2620 v3 which is listed as having a "Typical TDP: 85 W" which is a long way from the biggest CPUs supported in that system. Also when it first happened the CPU was almost idle.

I would really hope that iDRAC can keep going regardless of any CPU failure, that's really the point of it. As long as the CPU doesn't short-circuit the PSU or do anything else equally destructive the iDRAC should keep working. Past experience with Dell servers shows that they can run reliably with CPU temperature approaching the "high" level, I have not yet seen a Dell server exceed "high" or even have to run it's fans at full speed. Past experience with Intel CPUs on white-box systems is that they run reliably when exceeding the "high" temperature for hours every day.

1 Rookie

 • 

20 Posts

February 4th, 2025 01:11

I just had it happen again with a single PSU connected. I went to look at it when it stopped responding to pings and it was at the boot screen warning that advanced ECC isn't supported with only a single DIMM. I pressed F1 to continue and then it went to the full lights off and no response to power button state. I plugged the power into the other PSU and then it started working so I'm testing that now. If that PSU allows it to work for a couple of days then I'll swap the PSU into the other bay and see if it's a PSU problem or a bay problem.

1 Rookie

 • 

20 Posts

February 4th, 2025 01:18

This time it went down while I was writing the above post, didn't even run for long enough to complete one post. I went to it and it was totally off (power button not responsive) and then a few seconds later it powered on again and booted up (the system is configured to power on when mains power is restored after an outage).

1 Rookie

 • 

20 Posts

February 4th, 2025 03:25

It was running iDRAC 2.40.40.40, I upgraded it to 2.86.86.86 and will see if the problem still happens.

1 Rookie

 • 

20 Posts

February 4th, 2025 08:29

Still happens with the latest iDRAC

1 Rookie

 • 

20 Posts

February 15th, 2025 11:17

Could this be due to a faulty Power Backplane board J14R7 0J14R7 or the Power Distribution Board cable D47T0 ?

There doesn't seem to be anything else that could do it.

1 Rookie

 • 

20 Posts

February 26th, 2025 09:02

I just replaced the Power Backplane board J14R7 0J14R7 and it's been running for just over 24 hours. As it was an intermittent fault I don't count this as proof that it is fixed (I'll wait another couple of days to be sure) but previously it never lasted 8 hours so it seems most likely.

The old Power Backplane had no puffy capacitors, no scorch marks, and no other indication that it was faulty. I also used all the same cables so it doesn't look like a cable issue.

Thanks for all the debugging advice to rule out other things. I'll mark it as closed in a couple of days if it doesn't recur.

No Events found!

Top