This post is more than 5 years old
22 Posts
0
1843
February 7th, 2017 09:00
Scaleio failed capacity
Hello, we have a problem with scaleio cluster.
Failed capacity is reported if only one sds is down.
48830707 2017-02-07 17:55:20.649 SDS_DISCONNECTED ERROR SDS: devvirtp0024l00 (id: 1ce2de4e00000003) decoupled.
48830716 2017-02-07 17:55:21.650 MDM_DATA_FAILED CRITICAL The system is now in DATA FAILURE state. Some data is unavailable.
48830745 2017-02-07 17:55:23.490 SDS_RECONNECTED INFO SDS: devvirtp0024l00 (ID 1ce2de4e00000003) reconnected
48830770 2017-02-07 17:55:24.649 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.
Cluster is running 3 fault-sets.
There are enought capacity :
SDS Summary:
Total 15 SDS Nodes
15 SDS nodes have membership state 'Joined'
15 SDS nodes have connection state 'Connected'
51.7 TB (52935 GB) total capacity
28.5 TB (29227 GB) unused capacity
201.9 GB (206752 MB) snapshots capacity
15.3 TB (15682 GB) in-use capacity
15.1 TB (15480 GB) thin capacity
15.3 TB (15682 GB) protected capacity
Any ideas ?
Thanks,
Matas
pawelw1
306 Posts
1
February 8th, 2017 05:00
Hi Matas,
In a correctly configured, healthy cluster, I can't recall a situation when a single disconnected SDS would cause a DATA FAILURE state.
Can you please check in the GUI (or CLI - scli --query_sds) if there are any devices in the Error state?
Probably the best way of investigation would be through the Service Request, so we can view all the logs and see what exactly was going on there - please open an SR and we'll take it from there.
Thank you,
Pawel
SanjeevMalhotra
138 Posts
0
February 7th, 2017 22:00
Was there any rebuild activity already going on when the SDS disconnected?
This can happen when already some rebuild activity was going on and another SDS fails. In that case both the combs for certain data may not be available.
Following 2 events show that even after the SDS was reconnected, Data was still in degraded state:-
48830745 2017-02-07 17:55:23.490 SDS_RECONNECTED INFO SDS: devvirtp0024l00 (ID 1ce2de4e00000003) reconnected
48830770 2017-02-07 17:55:24.649 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state.
Matas1
22 Posts
1
February 7th, 2017 23:00
Hi, any rebuilding or re-balancing activities was noticed during SDS shut off.
Data was degraded after sds reconnected because IO was happening and some data needed to be rebuild. Also DRL was cleaned so degraded is normal after SDS was connected. but failed capacity is not expected and unacceptable.
We tried it on 4 SDS nodes (each time cluster was healthy) and in all times failed data was detected.
Anything else ?
Matas1
22 Posts
0
February 9th, 2017 04:00
Hello, it seems we were running SW bug, with checksum protection enabled. We disabled checksum protection, and tried maintenance mode and SDS off. worked perfectly. So root cause Checksum protection. Thank you.