Start a Conversation

Unsolved

R

1 Rookie

 • 

5 Posts

613

August 25th, 2020 08:00

Raidgroup in DEG state, failover to hotspare not possible *and* all luns in faulted state while data is still accessible

Goodafternoon,

I have 2 issues with an VNX5600. For some reason all my luns in the pool 0 have the status faulted however (luckilly :-)) all my data is still accessible. I believe this to be a fake state but I do not have a clear solution how to remedy the situation. I am thinking about rebooting the SP's one by one to see if it clears the state.

My second issue is more severe. One of my raidgroups is in DEG state. We have proactively replaced 2 disks in this RG and on one it seems that the rebuild got stuck. We replaced this disk again and all disks seem healthy but the state of the RG still remains in DEG state. Triage shows that one disk has 1 softerror but when we select this and choose "copy to hotspare" it replies "cannot copy to hotspare as the RG is in degraded mode". I feel like this is a chicken/egg situation and I do not have a good plan of approach how to remedy this situation.

Looking forward to your insights!

Regards
Remco

Moderator

 • 

7.6K Posts

August 25th, 2020 16:00

Hello remcoetten,

Are you seeing the fault showing on both SP management ports or just a single SP management port?  What is the raid level of your disk group and how many drives are in the disk group?  When you replaced the drives the first time did you replace both at the same time or did you replace the first drive and wait for the rebuild ton complete?  What is your current OE on your VNX5600?

1 Rookie

 • 

5 Posts

August 25th, 2020 23:00

Hi Sam,

My endcustomer replaced the disks, I believe, one by one. Both SP's are reporting the fault. See output from Triage :

[MLU Information]
+------+-----+---------------+----------+------+-----+-----+-------+------------+-------+----------+--------+----------+-------+-------+---------+--------+
| Pool |  ID | Name          | Capacity | Type | Def | Cur | Alloc | Compressed | Dedup | Dedup    | Dedup  | Dedup %  |   No. |   VNX | Faulted | Public |
|      |     |               |          |      |     |     |       |            |       | State    | Status | Complete |    of |   SMP |         | State  |
|      |     |               |          |      |     |     |       |            |       |          |        |          | Tiers |   Lun |         |        |
|      |     |               |          |      |     |     |       |            |       |          |        |          |       | Count |         |        |
+------+-----+---------------+----------+------+-----+-----+-------+------------+-------+----------+--------+----------+-------+-------+---------+--------+
|    0 |   1 | LUN 1         |  8192 GB | DLU  | SPB | SPB | SPB   | No         | No    | Disabled | OK     | 0        |     2 |     0 | Yes     | Ready  |
|      |     | (DATASTORE01) |          |      |     |     |       |            |       |          |        |          |       |       |         |        |
|    0 |   2 | LUN 2         |  8192 GB | DLU  | SPA | SPA | SPA   | No         | No    | Disabled | OK     | 0        |     2 |     0 | Yes     | Ready  |
|      |     | (DATASTORE02) |          |      |     |     |       |            |       |          |        |          |       |       |         |        |
|    0 |   3 | LUN 3         |  8192 GB | DLU  | SPB | SPB | SPB   | No         | No    | Disabled | OK     | 0        |     2 |     0 | Yes     | Ready  |
|      |     | (DATASTORE03) |          |      |     |     |       |            |       |          |        |          |       |       |         |        |
|    0 |   4 | LUN 4         |  8192 GB | DLU  | SPA | SPA | SPA   | No         | No    | Disabled | OK     | 0        |     2 |     0 | Yes     | Ready  |
|      |     | (DATASTORE04) |          |      |     |     |       |            |       |          |        |          |       |       |         |        |
|    0 |   5 | LUN 5         |  8192 GB | DLU  | SPB | SPB | SPB   | No         | No    | Disabled | OK     | 0        |     2 |     0 | Yes     | Ready  |

As for the RG, please see below :

Raid                    Hard  Soft  PFA& Abort Remap  Xfer Tmout   Par   Bad Inval Recon Recov  Affected
Group   RgType  State  Media Media  Hdwr ByDev  Errs  Errs  Errs   ity  Blks Sects Sects ByDrv  Disks
97      r6      ENA        0     6     0     0     0     0     5     0     3     0     2     3  2.1.9  
103     r5      ENA        0     1     0     0     0     0     0     0     0     0     0     1  1.1.1  
104     r5      ENA        0     3     0     0     0     0     0     0     0     0     0     3  0.1.21 
105     r5      ENA        0     0     0     0     0     0     1     0     0     0     4     0  0.0.9   0.1.16 
106     r5      ENA        0     3     0     0     0     0     0     0     0     0     0     3  0.1.13  0.1.14 
107     r5      ENA        0     1     0     0     0     0     0     0     0     0     0     1  0.1.7  
109     r5      ENA        0    24     0     0     0     0     0     0     0     0     0    24  1.0.20  1.0.22  1.0.23 
110     r5      ENA        0    10     0     0     0     0     0     0     0     0     0    10  1.0.16  1.0.19 
111     r5      ENA        0     5     0     0     0     0     0     0     0     0     0     5  1.0.10  1.0.12 
112     r5      ENA        0    10     0     0     0     0     0     0     5     0     3     5  1.0.6  
113     r5      ENA        0    16     0     0     0     0     0     0     0     0     0    16  1.0.1   1.0.2  
115     r5      ENA        0     7     0     0     0     0     0     0     0     0     0     7  0.0.20  0.0.23 
117     r5      DEG        0     1     0     0     0     0     0     0     0     0     0     1  0.0.13 
119     r5      ENA        0     0     0     0     0     0     0     0     0     0    10     0  0.0.6  

I have Triage/spcollects available if need be.

Regards

Remco

Moderator

 • 

7.6K Posts

August 26th, 2020 13:00

Hello remcoetten,

You can try rebooting the SP’s one at a time to see if that clears the fault.  Seeing as the DG is a Raid 5, if both drives that failed are part of that DG then you will need to restore from backup.  If you look at the SP collects check to see which drives failed and see which DG they are a part of.  I am fairly certain that they might be part of the same DG & that is why it stopped rebuilding.

1 Rookie

 • 

5 Posts

August 26th, 2020 23:00

Hi Sam,

Thanks for confirming my idea about the rebooting the SP's one at a time. I will schedule this.

With regards to the degraded RG, well, that is just it. There are no failed drives, all drives are optimal. So I have no clue how to remedy this situation.

Regards, Remco

Moderator

 • 

7.6K Posts

August 27th, 2020 08:00

Hello remcoetten,

I would reboot the SP’s first and see if the faults are still present.  If the faults are still present, then I will need a fresh set of sp collects so that I can see what is going on.

1 Rookie

 • 

5 Posts

August 27th, 2020 23:00

Hi Sam,

Thanks for helping however, I'm a bit afraid what could happen with a RG in Degraded state and then rebooting the SP's. Can I send you a copy of the SPcollects first for you to have a look? This is a university and I need to be sure (as possible) that they do not sustain downtime.

Thanks!

Remco

Moderator

 • 

7.6K Posts

August 28th, 2020 16:00

Hello remcoetten,

I will send you a private message so that you can send me the logs.

No Events found!

Top