Start a Conversation

Unsolved

K

1 Rookie

 • 

1 Message

176

November 27th, 2024 13:41

Samsung NVMe 990 Pro in Dell Poweredge R760xs issue

Dear Dell users,


during the last months we suffered from vanishing PCIe NVMe SSD in our Poweredge R760xs.

Because we are using the Server not for public customers but only for our development team, we did not by the original SSD solutions from dell. Instead we purchased 2 * 4 TB Samsung NVMe 990 Pro, made a bootable Soft-RAID 1 (mirror) with mdadm. Our OS is Debian 12 bookworm (LTSm Kernel 6.1.0).

After a few months without any problems one of the SSDs in one specific slot disappears from time to time. This happens once a week now mostly a nighttime during backups to our QNAP-NAS via NFS.

In Linux it looks like this:
journalctl --since "2024-10-13 03:00:00" _KERNEL_SUBSYSTEM=nvme
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 729 (I/O Cmd) QID 3 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 321 (I/O Cmd) QID 5 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 322 (I/O Cmd) QID 5 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 323 (I/O Cmd) QID 5 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 914 (I/O Cmd) QID 6 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 915 (I/O Cmd) QID 6 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 919 (I/O Cmd) QID 9 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 920 (I/O Cmd) QID 9 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: Abort status: 0x0
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 932 (I/O Cmd) QID 9 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: Abort status: 0x0
Oct 13 03:22:53 volta kernel: nvme nvme1: Abort status: 0x0
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 41 (I/O Cmd) QID 10 timeout, aborting
Oct 13 03:22:53 volta kernel: nvme nvme1: I/O 296 (I/O Cmd) QID 12 timeout, aborting
Oct 13 03:23:23 volta kernel: nvme nvme1: I/O 321 QID 5 timeout, reset controller
Oct 13 03:23:54 volta kernel: nvme nvme1: I/O 4 QID 0 timeout, reset controller
Oct 13 03:24:45 volta kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:24:45 volta kernel: nvme nvme1: Abort status: 0x371
Oct 13 03:25:06 volta kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Oct 13 03:25:06 volta kernel: nvme nvme1: Removing after probe failure status: -19
Oct 13 03:25:26 volta kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 14 (I/O Cmd) QID 1 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 15 (I/O Cmd) QID 1 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 3 (I/O Cmd) QID 6 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 4 (I/O Cmd) QID 6 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 5 (I/O Cmd) QID 6 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 6 (I/O Cmd) QID 6 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 7 (I/O Cmd) QID 6 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 8 (I/O Cmd) QID 6 timeout, aborting
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 14 QID 1 timeout, reset controller
Oct 15 03:20:12 volta kernel: nvme nvme1: I/O 6 QID 0 timeout, reset controller
Oct 15 03:20:12 volta kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Abort status: 0x371
Oct 15 03:20:12 volta kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Oct 15 03:20:12 volta kernel: nvme nvme1: Removing after probe failure status: -19
Oct 15 03:20:12 volta kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1

We tried many things to get rid of the problem:

We changed both NVMes to new ones: same thing
We bought new PCIe-NVMe Adapters (other brand): same thing
We exchanged the SSDs kompletely from one slot to the other: issue was still on the SAME PCIe slot
We exchanged only the Adaptors but kept the SSDs in the old slot: issue was mostly on the same slot (but one issue was on the other slot)

We installed other Kernels manually: with Kernel 6.10.11 the problem remains. We are now testing 6.11.15 with no problems for a week by now. I pray!


We were unable to reproduce the issue using hdparm -t and -T to stress the NVMEs. So we have to wait days to find out, if we had success or not.

The offer from Dell was using SAS-SSDs. The cost for 2*4 TB would be € 10000,00 (yes, tenthousend!)
I am trying to find another solution (other PCI-adapters, WD-Black instead of Samsung).
But the server only 2 HE. Higher cards does not fit :-(

Has anybody any hint?


Thanx


KlausBecker

1 Rookie

 • 

2 Posts

December 5th, 2024 08:54

Potential Solutions

1. BIOS Configuration Adjustments:

   - Disable VMD (Virtual Management Device): Some users have reported that disabling VMD in the BIOS helped stabilize their SSDs. This setting can sometimes interfere with NVMe detection.


   - Set NVMe Configuration to Gen4: Changing the NVMe settings from auto to Gen4 has resolved similar issues for others.

2. Power Management Settings:

   - Disable Hybrid Sleep: In Windows, hybrid sleep can cause SSDs to enter a low-power state, leading to detection issues. Disabling this feature in the advanced power settings might help.


   - Adjust Activity LED Settings: Turning off the activity LED has also been noted as a fix by some users, as it may reduce unnecessary load on the SSD.

3. Firmware and Driver Updates:

   - Ensure that both your SSD firmware and system drivers are up to date. Samsung has released firmware updates addressing various issues with the 990 Pro series, so it’s worth checking if you have the latest version installed.

4. Hardware Checks:

   - Test Different Slots: Since you've already tried swapping slots, ensure that the slots themselves are functioning properly. Testing with another known working SSD in the same slot can help determine if the issue is specific to that slot or the SSD itself.

   - Check Power Supply Unit (PSU): In some cases, a faulty PSU can lead to inconsistent power delivery to components, causing them to disappear intermittently. If possible, test with a different PSU.

5. Alternative Storage Solutions:

   - If these solutions do not work and Dell has suggested SAS-SSDs, it may be worth considering this option despite the higher cost, especially if reliability is critical for your backups.

## Community Insights
Many users have reported similar issues with Samsung 990 Pro SSDs disappearing under load or during specific operations like backups or gaming. The problem often appears related to firmware bugs or compatibility issues with certain motherboards and configurations. Engaging with Samsung support for potential RMA options could also be beneficial if the SSD is still under warranty.

Good luck!

2 Intern

 • 

623 Posts

February 9th, 2025 02:07

Similar issue reported with solution on a Linux forum here (archlinux bbs: Kernel & Hardware - [Solved] NVME disk dropping off).  Link to solution here (archlinux wiki: SSD - NVMe - Troubleshooting:  5.1 Controller failure due to broken APST support;  5.2 Controller failure due to broken suspend support).  It isn't clear which kernel parameter in 5.1 or 5.2 resolved the issue, but likely a pointer in the right direction.

1 Rookie

 • 

1 Message

April 16th, 2025 22:35

Hi, I had exactly the same problem on Precision 3460 with Samsung 990 Pro 2TB - SSD disappeared from time to time. My recommendation would be to enable “Full Power Mode” (Prevents SSD from going to sleep or idle state) using Samsung Magician software. Due to the fact I am running on Proxmox, SSD is not visible as Samsung on VM, so in my situation the easiest solution was to remove SSD from my server, install it in ordinary Win11 PC with Samsung Magician, change the Power settings and come back with such prepared SSD back to Precision 3460. Works well so far 😊

No Events found!

Top