Unsolved
1 Rookie
•
23 Posts
0
643
December 28th, 2023 14:53
VLT split-brain after OS10 upgrade on one S5248F node?
Hi all,
I have almost unbelievable problem.
We finally got rid of broadcast storm issues (https://www.dell.com/community/en/conversations/networking-general/cpu-load-of-basenas-service-in-os10/647f9638f4ccf8a8de8f5bc6) and can normally operate the core updates again.
Firmware upgrade was easy but encountered a problem while trying to upgrade OS10 from 10.5.2.9 to 10.5.5.7:
Following all the manuals I was able to find and the usual "how to upgrade VLT deployment with minimal loss of traffic" and when secondary (till that point) node reloads into 10.5.5.7 it pronounces itself as primary (next to other primary) and knocks down most of the network in facility.
Problem is easy to workaround with reloading back into 10.5.2.9 but upgrade is then prevented.
Am I doing something terribly wrong here because:
Dell SmartFabric OS10 Installation, Upgrade, and Downgrade Guide
clearly states that this is a supported upgrade.
I suspect a change in commands (not sure why Dell does this from time to time...) between versions but wasn't able to find anything in release notes all the way since 10.5.3?
vlt config is very basic:
vlt-domain 1
backup destination 172.16.20.1
discovery-interface ethernet1/1/49-1/1/52
peer-routing
primary-priority 65535
vlt-mac 00:00:00:00:00:02
PS:
While the split/brain state was in effect after examining logs from other systems I can say that at least some trunks and at least some orphaned ports were down but also have evidence that at least some trunks were up (access switch - access switch), but surely not ALL.
ESXi servers seems to also log redundancy loss and not connection loss (which was very fortunate for us - iSCSI storage is used).
Log from node 2 that was not reloaded at the time (became primary on first reload of node 1):
%IFM_OSTATE_DN: Interface operational state is down :ethernet1/1/52 - moment of reloading node 1%IFM_OSTATE_DN: Interface operational state is down :ethernet1/1/51%IFM_OSTATE_DN: Interface operational state is down :ethernet1/1/49%IFM_OSTATE_DN: Interface operational state is down :ethernet1/1/50%IFM_OSTATE_DN: Interface operational state is down :port-channel1000%IFM_OSTATE_DN: Interface operational state is down :vlan4094%STP_ROOT_CHANGE: STP:Root Brg Chg RSTP root changed.%STP_ROOT_CHANGE: STP:Root Brg Chg My ID:* OldRt:* NewRt:* - obscuring MACs%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%VLT_VLTi_LINK_DOWN: VLT interconnect link between unit 2 and unit 1 is down%VLT_PEER_DOWN: VLT unit 1 is down%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%VLT_ELECTION_ROLE: VLT unit 2 role transitioned from secondary to primary%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%VLT_HB_UP: VLT peer heartbeat link is up - node 1 came back but as primary in evident split-brain state and then one mnore reload is attempted after about 20 min%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%VLT_HB_DOWN: VLT peer heartbeat link is down%IFM_OSTATE_DN: Interface operational state is down :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :mgmt1/1/1%IFM_OSTATE_UP: Interface operational state is up :ethernet1/1/52%IFM_OSTATE_UP: Interface operational state is up :ethernet1/1/51%IFM_OSTATE_UP: Interface operational state is up :ethernet1/1/49%IFM_OSTATE_UP: Interface operational state is up :ethernet1/1/50%IFM_OSTATE_UP: Interface operational state is up :port-channel1000%IFM_OSTATE_UP: Interface operational state is up :vlan4094 - node 1 reloaded into old NOS version%VLT_PEER_UP: VLT unit 1 is up
DELL-Joey C
Moderator
•
3.9K Posts
0
December 29th, 2023 09:34
Hi,
I'm not that good at networking, but I'll try to help out. Ultimately, if we're unable to locate the issue, or it may be the firmware issue, you will need to contact the support line to log a case up.
I looked up the user guide on 10.5.5, it mentioned to make sure both peer devices are running on same firmware (page 1819: https://dell.to/3NJ4g5z). Can you check on that.
(edited)
NenadBosnjak
1 Rookie
•
23 Posts
0
December 29th, 2023 10:43
Thanks but both firmware are on latest -26 before attempting updating NOS.
I cannot open support ticket because both nodes are out of support (much longer story).
Currently trying with latest 10.5.3 because suspect this might be related with silly decision of Dell to change default MTU for OS10 in 10.5.4.4.
NenadBosnjak
1 Rookie
•
23 Posts
0
December 29th, 2023 11:30
Kind of expected, updating one node to 10.5.3.9 went without issues, VLT converged just fine.
Next would be latest 10.5.4 but before I probably plan to modify MTU for mgmt 1/1/1 (HB) to jumbo...
(edited)
NenadBosnjak
1 Rookie
•
23 Posts
0
December 29th, 2023 16:02
Little follow-up if anybody reading has an idea, Google basically only gives me this forum topic now ;)
MTU doesn't look like culprit here:
Updated both to v10.5.3.9 and all good as mentioned.
If I try to update to 10.5.4.10 on one node, I get same problem as with 10.5.5.7.
This time managed to get little bit more logging out of the attempt:
- Some time after boot VLT backup link goes up and ping from reloaded switch works. Interesting thing is that mgmt1/1/1 stays on 1532 MTU (possibly by design although default is changed to 9216 on reloaded switch).
- VLTi interfaces stay down on both sides (VLT failure)
- VLT is in split-brain with both nodes set as primary.
- log has nothing relevant (to me that is)
- MTU of VLTi is and was from start automatically on 9216
As mentioned have latest switch firmware deployed as first step and officially out of clues.
NenadBosnjak
1 Rookie
•
23 Posts
0
January 3rd, 2024 10:42
One thing that I was able to find now is that after 10.5.1 in the user manual there seems to be a recommendation to:
"
While configuring a VLT MAC address, if the 8th bit of the MAC address is a 1, then the MAC address is considered to be a multicast MAC address. There are locally defined MAC addresses. For these addresses, the second least significant bit in the first byte must be a 1, which signifies a locally defined address.
The correct MAC addresses must have xxxxxx10 bits set in the first octet, such as x2, x6, xA, xE, and so on.
While manually configuring MAC addresses for VLT, make the 7th bit a 1 - to signify a locally assigned address - and the 8th bit a 0 - to signify a unicast address - which essentially means that you must use one of the following formulas
"
We use vlt-mac "00:00:00:00:00:02" but still not sure how this could break VLT on upgrade of OS10 on one node from 10.5.3 -> 10.5.4? There is nothing in logs or anywhere that 10.5.4 would start penalize "wrong" MAC format here?
Not sure if changing mac on-the-fly now is something that would not cause network outage?
(edited)
Greig_Mitchell
1 Rookie
•
15 Posts
0
January 18th, 2024 10:23
Hello,
Did you ever get a resolution to this?
I have a couple of Dell S4128F switches in a VLT configuration that are needing upgraded but after reading this I'm now cautious about going to 10.5.4.4 or later.
In the terms of the VLT Mac Address we never manually set this during the initial VLT configuration, the system appears to have set this automatically.
We haven't changed the default MTU either, so ours are still at 1532.
I do remember reading somewhere within one of the OS10 release notes which stated a loop could occur if a static LAG was used. I know the VLTi port-channel interface between both switches use a static LAG by default but as far as I know this was always recommended by Dell?
(edited)
NenadBosnjak
1 Rookie
•
23 Posts
0
January 25th, 2024 09:27
Hi Greig
No still puzzled even more when google starts to spit this topic as one of only few results on any query related to "Dell OS10, vlt, split brain" ;)
Not an invitation to just do it but, knowing my luck lately, I'm kind of sure you will not have any issues ;)
Automatic MAC can be kept and is one of the nodes system MAC (don't remember higher or lower one on top of my head) which is AFAIR not recommended to keep, because it slows down primary election on e.g. cluster boot.
In the meantime got new pair of S5232f but already with 10.5.5.5 so wasn't keen on downgrading and possibly testing when it was blocking some other stuff to be installed.
(edited)
Greig_Mitchell
1 Rookie
•
15 Posts
0
January 25th, 2024 17:11
Hi Nenard,
Thanks for the update.
Yeah, for automatic VLT Mac Address it uses the primary nodes Mac. I don’t know what the implications of manually changing this would do but given that it’s working and we haven’t experienced any issues I’ll probably leave it as is.
What to you have in terms of Spanning Tree and VLT priorities?
For us our S4128F pair acts as our Core/Distribution and we are using RSTP with the primary node set to 4096 and the secondary node set to 8192.
We have also used the same values for our VLT priorities as well.
NenadBosnjak
1 Rookie
•
23 Posts
0
January 26th, 2024 06:48
Hi @Greig_Mitchell
True, automatic VLT MAC is AFAIK fully supported.
STP is almost the same as on your side, RSTP with primary on 0, sec on 4096, all other switches in network higher than that.
VLT primary prio. is 1 and 65535
We never had any issues with VLT except when I try to upgrade to 10.5.4.
NenadBosnjak
1 Rookie
•
23 Posts
0
May 23rd, 2024 07:33
Some updates:
I'm unable to reproduce the problem on freshly bought (second facility) pair of S5248f with identical configuration as used on my pair but of course isolated from rest of network (just connected to each other) ;)
Not sure why to assume here, since we are only left with Dell OS6 and Dell Os10 network devices.
Now also Dell OS6 is not compatible with Dell OS10?
Only difference that I can imagine is that my switches have HW rev 02 and new ones have it on 04.