Unsolved
This post is more than 5 years old
6 Operator
•
14.4K Posts
•
56.2K Points
0
5714
April 30th, 2015 02:00
Is this indication of faulty card/port?
For some time, we have been receiving timeouts on backups and we noticed error count on network card to increase. In our case it is veth1 which has eth4b and eth5b LACP bonded. Idea was that cable or SFP might be culprit so we asked floor management guys to check it. They took it out and back for a test and port went missing for good
In essence, this is your typical DD890 with dual 10GBE cards... something like this:
# system show hardware
Slot Vendor Device Ports
---- ------------ ------------------------ --------------
0 Intel 82576 Gigabit 0a, 0b
1 Qlogic Corp. QLE2562 8Gb FC 1a, 1b
2 LSI Logic SAS31601E 2a, 2b, 2c, 2d
3 LSI Logic SAS31601E 3a, 3b, 3c, 3d
4 Intel Dual Port 10GbE(82599EB) 4a, 4b
5 Intel Dual Port 10GbE(82599EB) 5a, 5b
6 EMC DD00 NVRAM Card
---- ------------ ------------------------ --------------
net show hardware does no longer show 4b port at all:
# net show hardware
Port Speed Duplex Supp Speeds Hardware Address Physical Link Status
----- ------- ------- ----------- ----------------- -------- -----------
eth0a unknown unknown 10/100/1000 00:8c:fa:19:68:d9 Copper no
eth0b unknown unknown 10/100/1000 00:8c:fa:19:68:d8 Copper no
eth4a 10Gb/s full 1000/10000 90:e2:ba:30:a2:1c Fiber yes
eth5a 10Gb/s full 1000/10000 90:e2:ba:30:a2:1c Fiber yes
eth5b 10Gb/s full 1000/10000 90:e2:ba:30:a2:1d Fiber yes
----- ------- ------- ----------- ----------------- -------- -----------
And one thing which makes me a bit puzzled more is this:
# system show ports
** Hardware access error.
Now, floor management checked SFP on switch and there are many errors there so they will try to changed it. Fair enough, but not sure really how would that impact shutdown of one port or even worse affect us to see other ports as seen in last command above.
We plan to reboot DD box and see if that makes any difference, but... it crossed my mind that cable might not be supported or something, but this setup was running error free for two years so I discarded it. As I write this I started generating support bundle to see more logs, but thought perhaps someone has an idea or two too... or perhaps has seen this before.


jbrooksuk
208 Posts
1
April 30th, 2015 03:00
Hi,
I've never seen a port go missing completely like that before.
I've seen one DOA card once, that never appeared at install at all - but it's very rare.
I would suspect that it is an HBA port failure on the DD itself, I doubt it's the SFP optical GBIC - rather the HBA itself.
I can't see that this would relate to anything outside the DD.
Even if it comes back after a reboot, the fact you were getting suspected intermittent errors before against this bond would suggest that total failure of eth4b is actually a good thing and just have EMC replace the HBA in slot 4.
Regards, Jonathan
ble1
6 Operator
•
14.4K Posts
•
56.2K Points
0
April 30th, 2015 03:00
Thanks Jonathan... will keep you updated with progress.
jbrooksuk
208 Posts
0
April 30th, 2015 03:00
Please do, I'd like to know the outcome
bobt5
49 Posts
0
April 30th, 2015 09:00
Hi Hrvoje,
A reboot will most likely fix it. Many times when the SFP is removed, the driver won't reset once it is reinstalled and the port goes missing.The reboot will reset the driver.
ble1
6 Operator
•
14.4K Posts
•
56.2K Points
0
April 30th, 2015 10:00
Yes, after reboot card is back. Normally we see errors during the night as when most of backups run. Right now, since it was started, I'm keeping an eye on it since counters were reset. I see for 4b following:
eth4b Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1D
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
RX packets:32359121 errors:2 dropped:0 overruns:0 frame:2
TX packets:273833325 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:290229885652 (270.2 GiB) TX bytes:1801864939090 (1.6 TiB)
It is too early to say anything, but having 2x RX and frame errors normally indicates CRC errors on recipient side. If so, and if it continues, I assume we will need to replace cable and if that doesn't change anything move to another port on switch. Of course, I could also request for card on DD to be replaced, but I'm still not 100% sure this is on DD side (and it is easier to replace cable or move to another port than have card replaced
).
jbrooksuk
208 Posts
0
May 1st, 2015 00:00
Oh... the SFP in the DD was removed/reinserted, sorry I thought the switch SFP was removed/reinserted - school boy error - sorry
Yeah, pulling it at the DD live may well do that, as I guess you've proved and Bob suggested.
Maybe shutdown and swap the SFP and it's LC cable between 4b and 5b (DD end) - see if the issue moves.
If it does then swap the cables between them, won't require a reboot, check again.
EMC don't ship SFP only, so if the problem is at the DD HBA (probably SFP), the whole card will need to be replaced anyway.
Cheers, Jonathan
ble1
6 Operator
•
14.4K Posts
•
56.2K Points
0
May 8th, 2015 03:00
Quick follow-up. After reboot of DD, first night was really nice and ok (no errors on backup end and on DD itself just two drops). Later on it started to increase. What I see is issue on both card if I look at drops, but just one with frame errors. In essence, since restart was made and counters were reset I see:
hcrvelin@DD# ifconfig -a
eth0a Link encap:Ethernet HWaddr 00:8C:FA:19:68:D9
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
eth0b Link encap:Ethernet HWaddr 00:8C:FA:19:68:D8
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
eth4a Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1C
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:1615185786 errors:0 dropped:1745 overruns:0 frame:0
TX packets:1005165519 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6913545373721 (6.2 TiB) TX bytes:114814514470 (106.9 GiB)
eth4b Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1D
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
RX packets:1232062737 errors:1036145 dropped:0 overruns:0 frame:1036145
TX packets:7053393747 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:11495133058857 (10.4 TiB) TX bytes:26895605379812 (24.4 TiB)
eth5a Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1C
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2581217549 errors:0 dropped:10677 overruns:0 frame:0
TX packets:778629305 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9185121513238 (8.3 TiB) TX bytes:144690072643 (134.7 GiB)
eth5b Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1D
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
RX packets:5918696425 errors:300 dropped:730 overruns:0 frame:300
TX packets:1157056711 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:30897381366464 (28.1 TiB) TX bytes:3036809152591 (2.7 TiB)
veth1.802 Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1D
inet addr: Bcast: Mask:
inet6 addr: fe80::92e2:baff:fe30:a21d/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:529527715 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:2975886163017 (2.7 TiB)
veth0 Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1C
inet addr: Bcast: Mask:
inet6 addr: fe80::92e2:baff:fe30:a21c/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:4196403335 errors:0 dropped:12422 overruns:0 frame:0
TX packets:1783794824 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:16098666886959 (14.6 TiB) TX bytes:259504587113 (241.6 GiB)
veth1 Link encap:Ethernet HWaddr 90:E2:BA:30:A2:1D
inet addr: Bcast: Mask:
inet6 addr: fe80::92e2:baff:fe30:a21d/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
RX packets:7150759162 errors:1036445 dropped:730 overruns:0 frame:1036445
TX packets:8210450458 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:42392514425321 (38.5 TiB) TX bytes:29932414532403 (27.2 TiB)
So, we will replace card in slot 4 now. I'm still not 100% sure this is card issue and I see those on bot 4b (very much) and 5b (both making veth1). As this is seen on both card I'm still a bit puzzled (and nature of L2L3 is such that most traffic would go/prefer one interface in bond so this might explain why 4b is more affected).
Anyway, support will replace it which brings me to one question. They require serial number of the card which I do not have nor I could find in autosupport. Is there a way (SE mode?) to get that one? As otherwise we need to shutdown the box and pull out the card to see it I guess.
bobt5
49 Posts
0
May 8th, 2015 09:00
Not sure why they need the SN of the card. I never asked for it when I was in support.
All they should need is if it's a 10/100/1000 or 10Gbe card, dual or quad port and if it's a copper or optical card. All of that should be available in an ASUP.
In SE mode, se lspci -v a device serial number is available but it's the MAC address.
ASUP Hardware VPD:
Controllers
umem0 Micro Memory NVRAM 0x6 (6.251) 1024 MBytes
umem1 Micro Memory NVRAM 0x6 (6.251) 1024 MBytes
ioc0 LSI Logic SAS1068E B3 105 011b0a00 0
ioc1 LSI Logic SAS1078 C2 105 011b0000 0
ioc2 LSI Logic SAS1068E B3 105 011b0a00 0
ioc3 LSI Logic SAS1068E B3 105 011b0a00 0
06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
15:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
16:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
ASUP Net show config:
eth5a Link encap:Ethernet HWaddr 90:E2:BA:0F:D8:50
inet addr:10.110.133.230 Bcast:10.110.135.255 Mask:255.255.248.0
inet6 addr: 2620:0:170:1a01:92e2:baff:fe0f:d850/64 Scope:Global
inet6 addr: fe80::92e2:baff:fe0f:d850/64 Scope:Link
UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1
RX packets:147879695 errors:5 dropped:3620966 overruns:0 frame:5
TX packets:1167251 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:11344602452 (10.5 GiB) TX bytes:237554886 (226.5 MiB)
se lspci -v
16:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Flags: bus master, fast devsel, latency 0, IRQ 25
Memory at c0c80000 (64-bit, prefetchable) [size=512K]
I/O ports at 3020 [size=32]
Memory at c0d04000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 90-e2-ba-ff-ff-0f-d8-50
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Kernel driver in use: ixgbe
Kernel modules: ixgbe
jbrooksuk
208 Posts
0
May 8th, 2015 10:00
You shouldn't need the SN of the HBA but this should show it (ignore the part number - EMC won't recognise it);
enclosure show chassis 1
No need for SE.
It is odd that you don't have balanced TX/RX transmissions across both members of the LACP bond.
Veth0 looks much more balanced...
Are they both in the same switch or different switches - if different - do you have VPC running at the switch?
You shouldn't see too much of a preference to one of the bond members.
IF you're brave and can suffer 10Gbe only - maybe pull eth4b and see what happens to your error counts
ble1
6 Operator
•
14.4K Posts
•
56.2K Points
0
May 8th, 2015 12:00
Support claims SN is obligatory for RMA at EMC. Not sure if that's something new or not.
ble1
6 Operator
•
14.4K Posts
•
56.2K Points
0
May 8th, 2015 12:00
Tried enclosure thingy (I believe that is also part of autosupport), but...
hcrvelin@DD# enclosure show chassis 1
This command may take up to a minute to complete. Please wait...
Enclosure 1
Chassis:
Chassis Part Number WC0479015001
Chassis Serial Number TWNW32CD00F
BMC Device Revision 1
BMC Firmware Revision 2.4
IPMI Version 2.0
BIOS Version 2.05
BIOS Release Date 09/19/2012
Chassis Sub-components:
Name Product ID Part No. Serial No. HW Rev
---------- ----------- ------------ ---------- ------
MLB 5241 1395A2535001 BE2CNK0081 na
Backplane 2U12BPY-001 1395A2303401 2H2BBK0361 na
LP Riser LPHLY-001 1395A2303601 BJ2BBK0456 na
FHFL Riser FHFLY-001 1395A2303501 BL2BBK0049 na
---------- ----------- ------------ ---------- ------
No serial number of card there. If I say show all within first enclosure I will only get mac address... for example:
4 Dual Port 10GbE(82599EB) 0.9-3 10GbE eth4a 90:e2:ba:30:a2:1c
eth4b 90:e2:ba:30:a2:1d
5 Dual Port 10GbE(82599EB) 0.9-3 10GbE eth5a 90:e2:ba:30:a2:1c
eth5b 90:e2:ba:30:a2:1d
As for balanced/non-balanced thing, this seems to be the case with L2L3 setting - I have seen some reports where this is balanced when using L3L4 instead. These cards are connected to different switches (VPC).
bobt5
49 Posts
0
May 8th, 2015 13:00
Whats the SR#?
I want to take a look and see who is handling this.
ble1
6 Operator
•
14.4K Posts
•
56.2K Points
0
May 8th, 2015 15:00
I'll try to find out - as we do not have direct EMC contract, but rather support goes over EMC partner.
Scouser1
1 Message
0
August 17th, 2015 07:00
Did you get a resolution to this one?
ble1
6 Operator
•
14.4K Posts
•
56.2K Points
0
August 21st, 2015 01:00
We replaced the card, but I didn't see error go away.. most likely something on switch side I guess. All continues to function despite the error.