1 Rookie
•
2 Posts
0
82
November 13th, 2024 18:01
Disbalanced single node on a H500 cluster
Greetings.
We have 24-node Isilon H500 cluster running OneFS 8.2.1.0 with total capacity of 3.8PB.
Something very odd has happened for which we're yet to find a fix. A single node of the cluster is really disbalanced. We have 23 nodes with disk usage 52-54% and a single node that is at 101% capacity.
So far we've tried the Autobalance and AutobalanceLin jobs with no success. Also we've had the Multiscan job run with no real effect.
We could use some wisdom on how to resolve this. Hardware-wise the cluster is in good condition. Everything is operational.
No Events found!
DELL-Josh Cr
Moderator
•
9.2K Posts
0
November 14th, 2024 13:43
Hi,
Thanks for your question.
Are you doing a file or full array rebalance? Are there a few large files that are not balancing? Are there snapshots? What is the output of the following commands? isi job reports list --job-type=autobalance
isi job reports list --job-type=autobalancelin
Let us know if you have any additional questions.
DELL-Sheron G
Moderator
•
227 Posts
0
December 4th, 2024 14:50
When was the last time a collect job or multiscan had run on this cluster?
One node can pick up larger amount of capacity if it had gone to a readonly or offline state in past and if collect or MultiScan has not run.
I ll start with collect.
Mind you , its a job that takes time.
Cheers,
nwfx
1 Rookie
•
2 Posts
0
December 4th, 2024 15:09
Alright. Thank guys.
I can finally confirm that the problem is solved. Took some time to find the root cause, but now it seems to be fine and cluster is slowly balancing.
Apparently that particular node worked ~2 years with just a single 40gbit external interface up. The other one was down due to issue with the DAC fiber cable. I believe this caused the node not balance and just fill up slowly over time.
After fixing that I ran Multiscan and Autobalance, but the cluster started actually balancing after starting the AutobalanceLin job. Now the node is at 78% and decreasing. Also the disks inside that node are quite busy at 99% while the job is Running.