1 Rookie
•
4 Posts
0
21177
July 30th, 2024 20:17
Dell 7875T NVMe FlexBay PCIe reset issues
I tried to raise this with support, providing all these exact details. Two my tickets got cancelled though, without any comms from Dell (more specifically: one got "Resolved" and one... disappeared from UI). Using the chat I got as far as "please reinstall windows" 😂. I'll keep on trying to get some human support attention for my new $10k workstation, but in the meantime, hoping to get some ideas here maybe.
______________________________________________________________________________________________
Summary
On a brand new Dell 7875T workstation I am experiencing frequent PCIe resets for NVMe SSDs in the M.2 FlexBay. The issue is easily reproducible and it happens:
Following various troubleshooting attempts (see below), this seems like a hardware problem to me. Hopefully just bad motherboard / cable in my machine, and not an overall design fault, like too long PCIe cable. |
Machine configuration
- Dell Precision 7875 Tower
- AMD Threadripper Pro 7985WX
- 1x 16GB DDR5-4800 ECC RAM
- AMD Radeon Pro W6400 4GB
- 1350W Platinum PSU
- Sandisk/WD SN740 256GB PCIe 3.0 x4 (“Class 35”)
- Windows 11 Pro
- BIOS 1.5.0
Problem description
Windows
With the Dell Windows environment, after clean install from OOB image and applying all the updates, the Event Viewer is registering following errors every few seconds:
Note the timestamps and frequency of the errors - they keep on continuously coming, every time I press F5 to refresh. There are two slightly different types of errors:
Microsoft-Windows-WHEA-Logger 17 None "A corrected hardware error has occurred. Component: PCI Express Endpoint Error Source: Generic Primary Bus:Device:Function: 0x6:0x0:0x0 Secondary Bus:Device:Function: 0x0:0x0:0x0 Primary Device Name:PCI\VEN_15B7&DEV_5015&SUBSYS_501515B7&REV_01 Secondary Device Name:PCI\VEN_1022&DEV_14A4&SUBSYS_0C611028&REV_01" |
Microsoft-Windows-WHEA-Logger 17 None "A corrected hardware error has occurred. Component: PCI Express Root Port Error Source: Generic Primary Bus:Device:Function: 0x0:0x3:0x3 Secondary Bus:Device:Function: 0x6:0x8:0x0 Primary Device Name:PCI\VEN_1022&DEV_14A5&SUBSYS_0C611028&REV_01 Secondary Device Name:" |
However, they are actually reporting the same event. The first error reports the reset from the NVMe SSD perspective (“PCI Express Endpoint”) and the second, reports the same event from the PCIe chipset perspective (“PCI Express Root Port” connected to the FlexBay)
The devices listed are:
- 0000:00.0 / 1022:14A4 - Advanced Micro Devices, Inc. [AMD] Device Dell Device 0c61
- 0000:03.3 / 1022:14A5 - Advanced Micro Devices, Inc. [AMD] Device Dell Device 0c61
- 0006:00.0 / 15B7:5015 - Sandisk Corp PC SN740 NVMe SSD (DRAM-less)
SN740 SSD is the drive that the workstation has been delivered with.
(I attached full Windows logs to my original request, in various formats)
Linux
Very similar errors are happening on Linux (kernel 6.9.9-200.fc40.x86_64, up-to-date Fedora 40):
[230.737508] {8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512 [230.737518] {8}[Hardware Error]: It has been corrected by h/w and requires no further action [230.737522] {8}[Hardware Error]: event severity: corrected [230.737525] {8}[Hardware Error]: Error 0, type: corrected [230.737529] {8}[Hardware Error]: section_type: PCIe error [230.737531] {8}[Hardware Error]: port_type: 4, root port [230.737533] {8}[Hardware Error]: version: 0.2 [230.737536] {8}[Hardware Error]: command: 0x0407, status: 0x0010 [230.737539] {8}[Hardware Error]: device_id: 0000:00:03.3 [230.737543] {8}[Hardware Error]: slot: 8 [230.737545] {8}[Hardware Error]: secondary_bus: 0x06 [230.737547] {8}[Hardware Error]: vendor_id: 0x1022, device_id: 0x14a5 [230.737550] {8}[Hardware Error]: class_code: 060400 [230.737552] {8}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0002 [230.737555] {8}[Hardware Error]: Error 1, type: corrected [230.737558] {8}[Hardware Error]: section_type: PCIe error [230.737560] {8}[Hardware Error]: port_type: 4, root port [230.737562] {8}[Hardware Error]: version: 0.2 [230.737565] {8}[Hardware Error]: command: 0x0407, status: 0x0010 [230.737568] {8}[Hardware Error]: device_id: 0000:00:03.3 [230.737571] {8}[Hardware Error]: slot: 8 [230.737572] {8}[Hardware Error]: secondary_bus: 0x06 [230.737575] {8}[Hardware Error]: vendor_id: 0x1022, device_id: 0x14a5 [230.737577] {8}[Hardware Error]: class_code: 060400 [230.737579] {8}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0002 [230.737637] pcieport 0000:00:03.3: AER: aer_status: 0x00000080, aer_mask: 0x00000000 [230.737647] pcieport 0000:00:03.3: [ 7] BadDLLP [230.737652] pcieport 0000:00:03.3: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID [230.737662] pcieport 0000:00:03.3: AER: aer_status: 0x00000080, aer_mask: 0x00000000 [230.737667] pcieport 0000:00:03.3: [ 7] BadDLLP [230.737671] pcieport 0000:00:03.3: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID |
Note the same PCIe root port device ID as for Windows errors: 0000:00:03.3, 1022:14a5.
Impact
The errors result in an occasional system reset - although infrequent and hard to reproduce.
Troubleshooting
System updates, reinstallation, etc.
The system has been reinstalled to a factory-default Dell Windows 11. The BIOS has been updated to the most recent version (1.5.0) and all the Windows Updates have been applied.
Any additional components that were not part of original delivery (extra RAM and SSDs) have been removed - all the troubleshooting was done in a factory-clean state.
Running Dell Diagnostics
Running BIOS Diagnostics gets stuck at “Testing Disk 1” and stays there, for at least 1h. Occasionally it might get stuck on “Testing FAN DDR 0” instead, but with the same “5 min 25 seconds remaining”.
The test can not be aborted by pressing ESC, machine has to be power-cycled.
Using different NVMe drives
I tried to understand the problem by trying different drives, with both Windows and Linux:
Model |
Relevant characteristics |
Reproducible |
Sandisk/WD SN740 256GB |
M.2 2230, PCIe 3.0 x4 (factory default) |
yes |
Western Digital Red SN700 2TB |
M.2 2280, PCIe 3.0 x4 |
yes |
Corsair MP600 Pro NH 8TB |
M.2 2280, PCIe 4.0 x4 |
yes |
Western Digital Black SN770M 2TB |
M.2 2230, PCIe 4.0 x4 |
yes |
Western Digital SN720 512GB |
M.2 2280, PCIe 3.0 x4, OPAL encryption |
yes |
Toshiba RC100 256GB |
M.2 2240, PCIe 2.0 x2 |
yes |
The issue is reproducible on all these drives. It is somewhat less frequent on slow drives (PCIe3 ones and RC100 in particular), but all of them experienced it in some way.
Lowering PCIe bus speed
Following above conclusion, I tried forcing a lower PCIe speed - both in BIOS setup and manually in Linux (using setpci command). I tried to force both PCIe Gen3 and Gen2. That did not eliminate the issue.
Using different FlexBay
I happen to have a spare DWPC700 M.2 FlexBay. I tried:
- Using just the SSD tray from there
- Replacing entire FlexBay module
Neither of these eliminated the issue.
Reseating PCIe cable
I tried reseating the PCIe cable that is connected to the FlexBay. That did not eliminate the issue.
The failure is specific to the FlexBay M.2 slot
Testing the onboard M.2 slots
After moving the NVMe SSD to the first M.2 slot on the motherboard:
- BIOS diagnostics fully passes within few minutes
- There are no PCIe reset error messages, from neither Windows nor Linux.
Additionally: through some of the above FlexBay testing, both onboard M.2 slots have been occupied with Corsair MP600 Pro NH 8TB SSDs - which never signaled problems either.
Testing a PCIe→M.2 adapter
Similarly, installing the NVMe SSD in a passive M.2 adapter and plugging it directly to a PCIe slot in the motherboard results in no issues. Both x16 and x8 slots were tested.
Booting the OS from FlexBay SATA drive
Installing an OS (both Windows and Linux) on a removable SATA drive works without issues.
Then, inserting a FlexBay M.2 NVMe SSD brings the errors back (coming from that secondary device).
Conclusion
As the issue is reproducible across different operating systems and SSDs and the only common thing is the FlexBay use, I would assume this might be either a problem with specific piece of my hardware:
- Mainboard
- PCIe cable to FlexBay
- FlexBay assembly (not the tray, I tried swapping that)
or an actual design issue
- PCIe cable too long to handle a M.2 drive
- Bad PCIe training at boot, not taking cable length into account (BIOS?)
mufl0n
1 Rookie
•
4 Posts
0
October 5th, 2024 09:34
Another month later: this is finally resolved.
My initial assessment was correct. The problem was: FlexBay NVMe backplane (the rear part of the FlexBay cage, P/N: "9GG5N ASSY Mechanical, Top, Rubber, 1S"). In hindsight, that was most logical explanation: faulty mainboard would likely have more issues than a single port, and a cable is not likely to be producing periodic errors 😉
The P/Ns are a bit confusing, as, from the stickers, the parts involved were:
- CN-09GG6N-FCW00-3CL-P003-A00 (metal cage)
- CN-04TXHV-FCW00-3AO-00IF-A00 (backplane board)
But, neverthelese, 9GG5N included both – and the PCIe cable (CN-0DN0PJ-AP200-39S-003L-A00) as well. After replacing the backplane, the errors disappeared.
Overall the support experience was truly terrible. Tickets were cancelled. Support people were clueless. The technicians were competent, but they were given wrong parts - three times. Support repeatedly asked for "what error messages are you seeing". All that despite all above troubleshooting and pretty much telling them what to do. And all that taking over two months, for something that was a 5min fix. The timeline:
2024-07-15: New machine arrives, OS installed, issue spotted.
2024-07-21: After week of troubleshooting, raised ticket #1, all above details attached (PDF, logs, etc).
2024-07-23: Ticket #1 gets marked as "Complete" without any comms.
2024-07-24: Raised ticket #2, this time including more description inline (+attachments again)
2024-07-24: Generic response for #2, acknowledging the problem as "Linux issue" 😏
2024-07-29: Ticket #2… disappears from the system.
2024-07-31: Raised ticket #3. This time in German (I’m in Switzerland)
2024-08-10: Email: "Ticket #1 is open" (it is not)
2024-08-17: Ticket #3… disappears from the system.
2024-09-02: Called the hotline. Looks like nobody looked at my inputs. Got a confirmation email, linked to ticket #2. To a wrong address (saw it only thanks to catch-all). Asked them to fix it. They promise a technician "this Thu/Fri".
2024-09-18: Tech #1 arrives.
- With… a Flexbay module 😲 (...that I already tried replacing myself and described that).
- Tech calls Dell. Question: "what error messages are you seeing?". Support says they haven’t seen any of my troubleshooting. Will send me an email and I can attach the documents there.
- Got the email. To a wrong address again. With yet different (none of above) ticket number linked. Replied, attached everything again.
2024-09-23: Next visit scheduled. Part list looks… suspicious:
- T2RP3 / JX33G - presumably mainboard (makes sense)
- JX33G, which is… the Flexbay module again! 🤦
2024-09-24: Tech #2 arrives. Indeed, with a new MB and another Flexbay module.
- Takes the new MB out of the package, looks at it "whoa… it’s clearly bent!"
- We agree to try replacing it, just to get a data point.
- MB replaced, Flexbay now completely dead. Front USB (where the bending was) dead. MB declared DOA.
2024-09-24: I follow up, reminding Dell that I told them twice what parts are needed.
2024-09-27: Next visit scheduled. Part list looks... suspicious:
- T2RP3 / JX33G - new mainboard (makes sense)
- R10HW ... power cable?
- MN4YT ... "PCIe holder"??
2024-10-02: Tech #3 arrives
- And indeed, has: new mainboard, PCIe power cable (??) and… the plastic holder for long GFX cards (!!) 🤦
- MB and power cable replaced. Machine is bootable again. Flexbay still dead.
- We notice that new PCIe power cable seems broken. Previous one swapped in. It works.
[ Machine is now back to original state - works, throws PCIe errors ]
- Tech calls Dell. "What error messages are you seeing?" 🤦. Support says they haven't seen any of my inputs.
- Tech reads out the exact part codes to the person on the phone, from stickers in the machine.
2024-10-04: Tech #4 arrives. Has the backplane assembly. Takes 5min to swap it, no errors ever since.
(edited)
mufl0n
1 Rookie
•
4 Posts
0
August 31st, 2024 13:11
One month later:
I bought the system on a premise of being able to use a M.2 in the FlexBay so, for now, I'm pretty much a proud owner of a 10'000 USD brick with a Dell logo.
mufl0n
1 Rookie
•
4 Posts
0
October 5th, 2024 10:33
Another month later: this is finally resolved.
My initial assessment was correct. The problem was: FlexBay NVMe backplane (the rear part of the FlexBay cage, P/N: "9GG5N ASSY Mechanical, Top, Rubber, 1S"). In hindsight, that was most logical explanation: faulty mainboard would likely have more issues than a single port, and a cable is not likely to be producing periodic errors 😉
The P/Ns are a bit confusing, as, from the stickers, the parts involved were:
- CN-09GG6N-FCW00-3CL-P003-A00 (metal cage)
- CN-04TXHV-FCW00-3AO-00IF-A00 (backplane board)
But, neverthelese, 9GG5N included both – and the PCIe cable (CN-0DN0PJ-AP200-39S-003L-A00) as well. After replacing the backplane, the errors disappeared.
Overall the support experience was truly terrible. Tickets were cancelled. Support people were clueless. The technicians were competent, but they were given wrong parts - three times. Support repeatedly asked for "what error messages are you seeing". All that despite all above troubleshooting and pretty much telling them what to do. And all that taking over two months, for something that was a 5min fix. The timeline:
2024-07-15: New machine arrives, OS installed, issue spotted.
2024-07-21: After week of troubleshooting, raised ticket #1, all above details attached (PDF, logs, etc).
2024-07-23: Ticket #1 gets marked as "Complete" without any comms.
2024-07-24: Raised ticket #2, this time including more description inline (+attachments again)
2024-07-24: Generic response for #2, acknowledging the problem as "Linux issue" 😏
2024-07-29: Ticket #2… disappears from the system.
2024-07-31: Raised ticket #3. This time in German (I’m in Switzerland)
2024-08-10: Email: "Ticket #1 is open" (it is not)
2024-08-17: Ticket #3… disappears from the system.
2024-09-02: Called the hotline. Looks like nobody looked at my inputs. Got a confirmation email, linked to ticket #2. To a wrong address (saw it only thanks to catch-all). Asked them to fix it. They promise a technician "this Thu/Fri".
2024-09-18: Tech #1 arrives.
- With… a Flexbay module 😲 (...that I already tried replacing myself and described that).
- Tech calls Dell. Question: "what error messages are you seeing?". Support says they haven’t seen any of my troubleshooting. Will send me an email and I can attach the documents there.
- Got the email. To a wrong address again. With yet different (none of above) ticket number linked. Replied, attached everything again.
2024-09-23: Next visit scheduled. Part list looks… suspicious:
- T2RP3 / JX33G - presumably mainboard (makes sense)
- JX33G, which is… the Flexbay module again! 🤦
2024-09-24: Tech #2 arrives. Indeed, with a new MB and another Flexbay module.
- Takes the new MB out of the package, looks at it "whoa… it’s clearly bent!"
- We agree to try replacing it, just to get a data point.
- MB replaced, Flexbay now completely dead. Front USB (where the bending was) dead. MB declared DOA.
2024-09-24: I follow up, reminding Dell that I told them twice what parts are needed.
2024-09-27: Next visit scheduled. Part list looks... suspicious:
- T2RP3 / JX33G - new mainboard (makes sense)
- R10HW ... power cable?
- MN4YT ... "PCIe holder"??
2024-10-02: Tech #3 arrives
- And indeed, has: new mainboard, PCIe power cable (??) and… the plastic holder for long GFX cards (!!) 🤦
- MB and power cable replaced. Machine is bootable again. Flexbay still dead.
- We notice that new PCIe power cable seems broken. Previous one swapped in. It works.
[ Machine is now back to original state - works, throws PCIe errors ]
- Tech calls Dell. "What error messages are you seeing?" 🤦. Support says they haven't seen any of my inputs.
- Tech reads out the exact part codes to the person on the phone, from stickers in the machine.
2024-10-04: Tech #4 arrives. Has the backplane assembly. Takes 5min to swap it, no errors ever since.
(edited)