Start a Conversation

Unsolved

This post is more than 5 years old

1035

April 20th, 2017 16:00

ECS Node Failure caused an outage

for an eight node setup, why would a single node failure caused an outage meaning IO stopped and no transfer.

It's a cloudpool setup from Isilon and we are using HAProxy servicing the 8 nodes.  When we simulated one node failure by bringing down the public-data interface, IO stopped on us. I do not understand why ?  After we brought back up the interface,  it took a while for ECS to recognize that the node is already up along with it's disks.

I do not understand that behavior . That's not what we are expecting. We can't live to Prod until this is resolved.


Any clue ? Support fumbled around our system and didn't find anything.  

22 Posts

April 21st, 2017 08:00

Also, have you verified that HAProxy will attempt to route the request to a different node? (this is related to Jason's question asking if you are health checking on your HAProxy).

Have you verified that if you bypass HAProxy and send requests directly to one of the nodes that are up, whether you can do I/O? Or do you still have the I/O problem?

281 Posts

April 21st, 2017 08:00

This is not expected behavior.  Are you health checking on your HAProxy?  Are you running network separation?

If you have the SR, I can look into it.

1 Rookie

 • 

20 Posts

April 26th, 2017 13:00

Hi guys, I was able to engaged our ECS SE and the issue was with how we down the interface. Since we only did a ifconfig public.data down instead of ifconfig public (the physical interface) down to simulate a node down, which created an event for IO to stop on the ECS. Ifconfig piblic down will  isolate that node completely and HAProxy should have responded better if that node lost the heartbeat.  Also to down a node for HA failover test, we can also put the node to maintenance mode. Our procedure was wrong to do HA failover testing. Thank you for your response.

No Events found!

Top