Failure of primary VLT peer causes major outages (S4048-ON OS9)

Question

We have a four-node Microsoft Failover Cluster, with each server equipped with a pair of NICs configured in a Switch Embedded Team (SET). Each NIC within the team is connected to one of the two peers in the VLT domain, with a single link per connection. The VLT domain connects to an "access" switch via a VLT port-channel with LACP, facilitating client access.

We have followed best practices and official documentation to ensure that SET and VLT are configured correctly. However, during fault/failure simulations, we consistently observe catastrophic outages affecting the cluster, but only when the tests are conducted against the "primary" VLT peer. These issues include nodes being dropped from the cluster, VMs failing, crashing, or entering a paused state, and Cluster Shared Volumes (CSVs) disconnecting.

For example, the following conditions will cause our cluster to enter a failed state and lose network connectivity for an unacceptable amount of time:

Reloading the primary VLT peer by pulling the power or by issuing the reload command
Administratively shutting down all server ports, VLT port-channel uplink and VLTi

The individual links to the servers fail over gracefully. Killing the VLTi on the secondary VLT peer also results in a graceful failover. Reloading the secondary VLT peer causes a graceful failover as well.

We expect each peer to handle failures similarly, but they clearly do not. We’re out of ideas... and almost out of drywall to bang our heads against. Any assistance would be greatly appreciated.

The_LostIT_Guy · Accepted Answer

We finally determined the cause of our issue - Spanning Tree.

Per Dell's documentation, we enabled RSTP to avoid loops during initial configuration. We thought that we had disabled it afterwards, but we either forgot to save the config or we were just mistaken. Anyway, RSTP was also enabled on our access switch.

What we were experiencing was a topology change being detected by spanning tree when the primary peer failed. Spanning tree would go through its listening, learning and blocking states. Traffic was temporarily disrupted on the access switch's downlink to the surviving VLT peer. Since topology changes are distributed across the entire layer-2 network and spanning tree was enabled on the VLT peers, traffic was also temporarily disrupted on the surviving VLT peer at the ports attached to the cluster nodes. This explains the temporary interruption to both client access and cluster communications.

We have two options moving forward: Disable spanning tree on the downlink interfaces of the access switch that makes up the LACP LAG towards the VLT domain and globally on both VLT peers or keep spanning tree enabled and configure the interfaces attached to end stations as EdgePorts.

None of the folks here, including myself, are network engineers. This leaves us wondering what the best path is. My best educated guess is that since VLT is "loop-free by design" we can confidently disable spanning tree in the locations mentioned above. We would also have to ensure we don't inadvertently introduce loops in the future by making configuration or patching errors.

Closing remarks . . . While this is mostly a blunder on our part for not having as strong of an understanding of networking as we should, I feel like Dell's documentation could be improved. Even though we may have forgotten to disable spanning tree on the VLT peers, I wouldn't have thought that to be an issue since the VLT section of the Dell docs don't appear to imply otherwise. They do encourage you to turn it on after all. Furthermore, they provide a configuration example that would lead you to believe you are following best practices. There's also no mention of recommended spanning tree configuration for switches attached to the VLT domain (in the VLT section of the docs anyway). It's in an entirely separate section where they mention STP's potential to cause traffic issues with VLT. I would think that it would be mentioned in the VLT section if it was a concern.

DELL-Joey C · Answer

Hello,

I'm going to combine the other post that you have posted, here. Since it's the same issue that you're having.

My best recommendation is to contact support to raise a case for engineer to analyze your environment, since you have mentioned that you have followed the documentation. Here in the forum, we don't have the support to analyze both the switch configurations and logs. More over, there might be a need of Windows engineer to have a look at the Cluster logs.

The_LostIT_Guy · Answer

We have a 4 Node Windows failover cluster utilizing Switch Embedded Teaming (SET), Two Dell S4048-ON switches utilizing VLT. One link from each NIC in the SET to each VLT Peer. We have simulated multiple failure scenarios, and we only ever experience catastrophic results when the failure involves the primary VLT peer. Nodes drop from the cluster, cluster shared volumes lose connection, VMs crash...

If we reload the VLT peers by pulling the power or by issuing the reload command, our cluster folds if it was the primary VLT peer that was reloaded.

If we shutdown all the interfaces manually (including the VLTi), our cluster folds if that was done on the primary VLT peer.

We can reload and kill interfaces on the secondary peer all day without issue.

Note: This comment was created from a merged conversation originally titled Primary VLT Peer failure causes catastrophic failure for Windows Cluster

The_LostIT_Guy · Answer

@DELL-Joey C​ Thanks for combining. Not sure what happened there. After I created the original post, I was presented with a 404 page. It didn't show up in my profile or anywhere on the forum, so I re-posted sometime later.

DELL-Charles R · Answer

Hello,

To verify; are you saying you followed this document?

Dell Configuration Guide for the S4048–ON System 9.14.2.5

Configure Virtual Link Trunking

https://dell.to/4e1G6hD

Make sure to review the Important Points to Remember and Configuration Notes

One point to call out:

Dell EMC Networking strongly recommends that the VLTi (VLT interconnect) be a static LAG and that you disable LACP on the VLTi.

I would recommend, as Joey did, to contact Support directly and an engineer can do a remote session to get a look with you.

The forum is not set up for that type of engagement.

The_LostIT_Guy · Answer

@DELL-Charles R​ Yes, that is correct. We followed the exact steps from pages 1007 and 1008: Created a port-channel 128 Assigned interfaces no switchport no shutdown Established peering relationship with `peer-link port-channel 128` Given that there were no configuration steps taken to enable LACP, we are under the impression that means that it is disabled. Is it implied that LACP is enabled by default when you establish the VLTi??

DELL-Charles R · Answer

Hello,

Try this KB article #000118094 I think it will help:

How to set up Virtual Link Trunking (VLT) on Dell Networking OS9 Switches

https://dell.to/474oMGl

Is this a configuration that was working and now does not work?

If so do you know anything that was done around the time the issue started?

The_LostIT_Guy · Answer

@DELL-Charles R​ No, technically it never worked as expected. Our initial testing led us to believe everything was ok, but we were only doing simple pings to the servers. It wasn't until we installed failover clustering that we noticed deeper issues. The steps we took are nearly identical to what was in that How-To doc btw.  What appears to be happening is that cluster communication (heartbeat/hello messages) between the Windows Server nodes gets severed whenever the primary VLT peer goes down. To confirm this, we had an unused LOM adapter setup on a 'Management' network that ran through a separate management switch in the rack and enabled cluster communication on it. The results were better after simulating failures on the primary VLT peer. We didn’t observe the major symptoms (nodes losing membership, crashing VMs…) from above, but the cluster still threw some warnings/errors. It is my understanding that VLT ports on the secondary will be held down if the VLTi goes down, but the heartbeat indicates that both peers are active. This is to prevent loops or something. Anyway, I would expect our cluster to behave the way it has been behaving if this was the only way we were simulating failures (killing the VLTi). However, we have been pulling the power the simulate total primary peer loss and traffic just does not want to flow over the secondary until a large amount of time has passed (more than 2 minutes). We have tried with and without static LAGs between the servers (because SET does not support LACP). We're beginning to think that Switch Embedded Teaming and VLT just do not play nice together.

DELL-Joey C · Answer

Hi,   One last thing to check, have you tried NLB configuration? https://dell.to/4fXJg7N, perhaps you have come across this article already.

The_LostIT_Guy · Answer

@DELL-Joey C​ I have not come across that yet! I will read through that this morning. Thanks!

The_LostIT_Guy · Answer

@DELL-Joey C​ After spending some time looking over that article and into Microsoft NLB with my team, we've determined that it is not applicable to our current environment. Microsoft Failover Clustering and NLB clustering are not synonymous. Both cannot be configured together...It is one or the other. Regardless of that fact, we did try to configure the switch for NLB per the doc you sent, but our results were the same.

DELL-Joey C · Answer

Hi,

Thanks for the feedback. Perhaps, I think it is time to contact support to analyze the configuration based on your environment settings. They might need both network and Microsoft experts to look into the deployment.

Networking General

Failure of primary VLT peer causes major outages (S4048-ON OS9)

Was this post helpful?