Start a Conversation

Solved!

Go to Solution

1 Rookie

 • 

2 Posts

82

November 13th, 2024 18:01

Disbalanced single node on a H500 cluster

Greetings.

We have 24-node Isilon H500 cluster running OneFS 8.2.1.0 with total capacity of 3.8PB.
Something very odd has happened for which we're yet to find a fix. A single node of the cluster is really disbalanced. We have 23 nodes with disk usage 52-54% and a single node that is at 101% capacity. 

So far we've tried the Autobalance and AutobalanceLin jobs with no success. Also we've had the Multiscan job run with no real effect.

We could use some wisdom on how to resolve this. Hardware-wise the cluster is in good condition. Everything is operational.

Moderator

 • 

9.2K Posts

November 14th, 2024 13:43

Hi,

Thanks for your question.

Are you doing a file or full array rebalance? Are there a few large files that are not balancing? Are there snapshots? What is the output of the following commands? isi job reports list --job-type=autobalance

isi job reports list --job-type=autobalancelin

Let us know if you have any additional questions.

Moderator

 • 

227 Posts

December 4th, 2024 14:50

When was the last time a collect job or multiscan had run on this cluster? 
One node can pick up larger amount of capacity if it had gone to a readonly or offline state in past and if collect or MultiScan has not run.
I ll start with collect.
Mind you , its a job that takes time.
Cheers,

1 Rookie

 • 

2 Posts

December 4th, 2024 15:09

Alright. Thank guys.

I can finally confirm that the problem is solved. Took some time to find the root cause, but now it seems to be fine and cluster is slowly balancing.

Apparently that particular node worked ~2 years with just a single 40gbit external interface up. The other one was down due to issue with the DAC fiber cable. I believe this caused the node not balance and just fill up slowly over time. 

After fixing that I ran Multiscan and Autobalance, but the cluster started actually balancing after starting the AutobalanceLin job. Now the node is at 78% and decreasing. Also the disks inside that node are quite busy at 99% while the job is Running.

No Events found!

Top