R720 bay 12 (the 13th) has a higher CRC error rate than others

Short story:
I have come to the conclusion there is a design issue with Dell R720
servers which have 16 x 2.5" drive bays, with bay 12 (the 13th) having
significantly higher CRC error rates than all the others. You can
check this out using this command:

# for D in `seq 0 15`; do echo "=== Drive $D ===" ; smartctl --all -d
sat+megaraid,$D /dev/sdb | egrep "Device Model|Firmware Version|^199";
done

parameter 199 is the sas/sata bus CRC error count.

I have tried this on four R720 servers, bought at different times, but all have the same drive configuration, and we see parameter 199 being zero or very low for all drive bays except bay 12 (the 13th, first bay is zero).

is this a known design issue with the R720s?

Long story:

We have a bunch of Dell R720 servers and by and large have been very
satisfied. Yesterday I was setting some new servers up.

We leave bays 0 and 1 carrying the standard electromechanical 600GB drives as a mirrored pair, then bays 8 to 15 a bunch are fitted with SSDs (Samsung 840pro, 512G) in their own raid set.

The RAID controller is this , according to "lspci | grep - raid":

03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03)

When setting up the virtual drive on one of the new servers, it decided it hated the drive in bay 12, and despite removal, reinserting, deleting
the virtual drive etc, it wouldn't touch it.

Fortunately I had a spare drive and that was fine. On returning to the office, I checked the "bad" drive on a desktop PC and could see no reason for
an issue, except for parameter 199; the drive is brand new fresh out of the packaging, usage/hour count only 1, no obvious sign of problems, and a long/full smartctl-offline-test passes without an issue.

View All

No Events found!

FluidFS

R720 bay 12 (the 13th) has a higher CRC error rate than others

Was this post helpful?