Nortel 5520 Stack problem during failure

SpaceHonkey · Mar 19, 2009

I've got a stack of 2 5520s with MLTs configured (split across each unit). When I test for a failure by unplugging the top unit, I lose all connectivity for about 15-30 seconds, and then it recovers. (My desktop is plugged into the bottom unit) First thought was STP, but I can turn off STP on all ports and still have the problem.

Any ideas?

SYN ACK · Mar 19, 2009

what software revision.
how is IP configured (stack IP for management, or seperate switch IPs on both)

is the top switch in your scenario the Master/BASE? (check the dip switches at the back)

when you say you power down the top switch, do you lose connectivity to the switch's management interface, or all forwarding across all interfaces

-do you lose pings to the switch
-do you lose pings to other PCs on the same switch

are all devices in the same VLAN?

what is the 5520 uplinked to (via MLT),

15-30seconds sounds about right for spanning tree convergence.

where is the MLT running/connecting to upstream?
try disabling STP on the MLT trunk ports that are bundled at the 5520s and the upstream switches, and test again.

shouldn't matter. STP only converges once across the MLT...once both ports (or more) are brought up in the MLT and STP is converged (and begins to forward), you can remove any of the links / plug back in without any STP convergence (that is, until ALL of the links go down and 1or more come back up, then it has to re-converge)

SpaceHonkey · Mar 19, 2009

Well, unfortunately - when doing further testing, I apparently powered up the switch again before unit 2 was done failing over - resulting in a complete loss of config.

But, it's not in production, and it'll only be a little work to get back going.

To answer your questions:

Yes I lost connectivity to all devices plugged in, including pinging the management IP.

These are used for a VMware setup, all MLTs are for ESX hosts (2 nics each).

SpaceHonkey · Mar 19, 2009

Ok, tried again.

Continuous ping to:

Management IP
ESX Host (MLT)
VM running on ESX Host
vCenter Server (LACP)
Another PC on switch

This time when unplugging, I lost one packet to the other PC - all others failed for 10+ seconds.

SYN ACK · Mar 19, 2009

very strange.
is the top switch the BASE?

forwarding at layer-2 should not be disrupted on the slave switch(es) when the base goes down. i don't recall having seen this in my nortel days.

did you also drop packets to a device elsewhere in the network (traversing your MLTs)?

SpaceHonkey · Mar 20, 2009

Yes, top switch is the base - sorry for not answering that.

Yes, I lost connectivity to a VM, served by an ESX host that has a 2 member MLT trunk.

SYN ACK · Mar 20, 2009

code revision?

SpaceHonkey · Mar 20, 2009

! Model = Ethernet Routing Switch 5520-48T-PWR
! Software version = v5.1.2.035

I was using revision 6.something, but it behaved really badly when recovering from the same failure - it would kill the working good port on the unit that stayed up (MLTs only, not LACP).

SpaceHonkey · Mar 20, 2009

Here's the relevant config info -

Code:

! Embedded ASCII Configuration Generator Script
! NOTE:  This file may be split into multiple files.
!        It is noted at the end of this file if this
!        is the case.
! Model = Ethernet Routing Switch 5520-48T-PWR
! Software version = v5.1.2.035
enable
configure terminal
!
! *** CORE ***
!
autosave enable
mac-address-table aging-time 300
autotopology
!
! *** STP (Phase 1) ***
!
spanning-tree cost-calc-mode dot1d
spanning-tree port-mode normal
spanning-tree stp 1 priority 8000
spanning-tree stp 1 hello-time 2
spanning-tree stp 1 forward-time 15 max-age 20
spanning-tree stp 1 tagged-bpdu disable tagged-bpdu-vid 4001
spanning-tree stp 1 multicast-address 01:80:c2:00:00:00
no spanning-tree 802dot1d-port-compliance enable
!
! *** VLAN ***
!
vlan configcontrol flexible
vlan name 1 "VLAN #1"
vlan create 90 name "VLAN #90" type port
vlan create 108 name "VLAN #108" type port
vlan create 301 name "Hardware VLAN" type port
vlan create 302 name "ESX Mgmt VLAN" type port
vlan create 303 name "Backup VLAN" type port
vlan create 304 name "vMotion VLAN" type port
vlan create 305 name "Testing VLAN" type port
vlan ports 1/1-14 tagging unTagPvidOnly  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan ports 1/15 tagging unTagAll  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan ports 1/16 tagging tagAll  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan ports 1/17-48 tagging unTagAll  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan ports 2/1-14 tagging unTagPvidOnly  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan ports 2/15 tagging unTagAll  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan ports 2/16 tagging tagAll  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan ports 2/17-48 tagging unTagAll  filter-untagged-frame disable filter-unregistered-frames enable priority 0 
vlan members 1 1/ALL,2/ALL
vlan members 90 1/1-14,1/48,2/1-14,2/48
vlan members 108 1/1-14,1/16,1/48,2/1-14,2/16,2/48
vlan members 301 1/1-19,2/1-18,2/36,2/40,2/42
vlan members 302 1/1-14,1/16,2/1-14,2/16
vlan members 303 1/1-14,1/16,2/1-14,2/16
vlan members 304 1/1-14,1/16,2/1-14,2/16
vlan members 305 1/1-14,1/16,2/1-14,2/16
vlan ports 1/1-14 pvid 302
vlan ports 1/15 pvid 301
vlan ports 1/16 pvid 1
vlan ports 1/17-19 pvid 301
vlan ports 1/20-48 pvid 1
vlan ports 2/1-14 pvid 302
vlan ports 2/15 pvid 301
vlan ports 2/16 pvid 1
vlan ports 2/17-18 pvid 301
vlan ports 2/19-35 pvid 1
vlan ports 2/36 pvid 301
vlan ports 2/37-39 pvid 1
vlan ports 2/40 pvid 301
vlan ports 2/41 pvid 1
vlan ports 2/42 pvid 301
vlan ports 2/43-48 pvid 1
vlan igmp unknown-mcast-no-flood disable
vlan igmp 1 snooping disable
vlan igmp 1 proxy disable robust-value 2 query-interval 125
vlan igmp 90 snooping disable
vlan igmp 90 proxy disable robust-value 2 query-interval 125
vlan igmp 108 snooping disable
vlan igmp 108 proxy disable robust-value 2 query-interval 125
vlan igmp 301 snooping disable
vlan igmp 301 proxy disable robust-value 2 query-interval 125
vlan igmp 302 snooping disable
vlan igmp 302 proxy disable robust-value 2 query-interval 125
vlan igmp 303 snooping disable
vlan igmp 303 proxy disable robust-value 2 query-interval 125
vlan igmp 304 snooping disable
vlan igmp 304 proxy disable robust-value 2 query-interval 125
vlan igmp 305 snooping disable
vlan igmp 305 proxy disable robust-value 2 query-interval 125
no auto-pvid
!
! *** Interface ***
!
interface FastEthernet ALL
default auto-negotiation-advertisements port 1/ALL,2/ALL 
shutdown port 1/20-44,2/19-35,2/37-39,2/41,2/43-44 
no shutdown port 1/1-19,1/45-48,2/1-18,2/36,2/40,2/42,2/45-48 
snmp trap link-status port 1/ALL,2/ALL enable
speed port 1/ALL,2/ALL auto
duplex port 1/ALL,2/ALL auto
exit
interface FastEthernet ALL 
rate-limit port 1/ALL,2/ALL both 0 
exit
!
! *** MLT (Phase 1) ***
!
no mlt
mlt 1 name "Blade 1" enable member 1/1,2/1 learning normal
mlt 1 learning normal
mlt 1 loadbalance advance
mlt 2 name "Blade 2" enable member 1/2,2/2 learning normal
mlt 2 learning normal
mlt 2 loadbalance advance
mlt 3 name "Blade 3" enable member 1/3,2/3 learning normal
mlt 3 learning normal
mlt 3 loadbalance advance
mlt 4 name "Blade 4" enable member 1/4,2/4 learning normal
mlt 4 learning normal
mlt 4 loadbalance advance
mlt 5 name "Blade 5" enable member 1/5,2/5 learning normal
mlt 5 learning normal
mlt 5 loadbalance advance
mlt 6 name "Blade 6" enable member 1/6,2/6 learning normal
mlt 6 learning normal
mlt 6 loadbalance advance
mlt 7 name "Blade 7" enable member 1/7,2/7 learning normal
mlt 7 learning normal
mlt 7 loadbalance advance
mlt 8 name "Blade 8" enable member 1/8,2/8 learning normal
mlt 8 learning normal
mlt 8 loadbalance advance
mlt 9 name "Blade 9" enable member 1/9,2/9 learning normal
mlt 9 learning normal
mlt 9 loadbalance advance
mlt 10 name "Blade 10" enable member 1/10,2/10 learning normal
mlt 10 learning normal
mlt 10 loadbalance advance
mlt 11 name "Blade 11" enable member 1/11,2/11 learning normal
mlt 11 learning normal
mlt 11 loadbalance advance
mlt 12 name "Blade 12" enable member 1/12,2/12 learning normal
mlt 12 learning normal
mlt 12 loadbalance advance
mlt 13 name "Blade 13" enable member 1/13,2/13 learning normal
mlt 13 learning normal
mlt 13 loadbalance advance
mlt 14 name "Blade 14" enable member 1/14,2/14 learning normal
mlt 14 learning normal
mlt 14 loadbalance advance
!
! *** LACP ***
!
lacp system-priority 32768
lacp port-mode default
interface fastEthernet ALL
lacp key port 1/1-15 1
lacp mode port 1/1-15 off
lacp key port 1/16 100
lacp mode port 1/16 active
lacp key port 1/17-48,2/1-15 1
lacp mode port 1/17-48,2/1-15 off
lacp key port 2/16 100
lacp mode port 2/16 active
no lacp aggregation port 1/1-15,1/17-48,2/1-15,2/17-48 enable
lacp mode port 2/17-48 off
lacp key port 2/17-48 1
lacp priority port 1/ALL,2/ALL 32768
lacp timeout-time port 1/ALL,2/ALL long
lacp aggregation port 1/16,2/16 enable
exit
!
! *** STP (Phase 2) ***
!
spanning-tree stp 1 add-vlan 1
spanning-tree stp 1 add-vlan 90
spanning-tree stp 1 add-vlan 108
spanning-tree stp 1 add-vlan 301
spanning-tree stp 1 add-vlan 302
spanning-tree stp 1 add-vlan 303
spanning-tree stp 1 add-vlan 304
spanning-tree stp 1 add-vlan 305
interface FastEthernet ALL
spanning-tree port 1/15-48 learning normal 
spanning-tree port 2/15-48 learning normal 
spanning-tree port 1/15-48 priority 80
spanning-tree port 2/15-48 priority 80
spanning-tree bpdu-filtering port 1/ALL timeout 120
no spanning-tree bpdu-filtering port 1/ALL enable
spanning-tree bpdu-filtering port 2/ALL timeout 120
no spanning-tree bpdu-filtering port 2/ALL enable
exit
interface FastEthernet ALL
spanning-tree port 1/1-14 learning disable
exit
interface FastEthernet ALL
spanning-tree port 2/1-14 learning disable
exit
!
! *** VLAN Phase 2***
!
vlan mgmt 1
!
! *** MLT (Phase 2) ***
!
mlt spanning-tree 1 stp 1 learning disable
mlt spanning-tree 2 stp 1 learning disable
mlt spanning-tree 3 stp 1 learning disable
mlt spanning-tree 4 stp 1 learning disable
mlt spanning-tree 5 stp 1 learning disable
mlt spanning-tree 6 stp 1 learning disable
mlt spanning-tree 7 stp 1 learning disable
mlt spanning-tree 8 stp 1 learning disable
mlt spanning-tree 9 stp 1 learning disable
mlt spanning-tree 10 stp 1 learning disable
mlt spanning-tree 11 stp 1 learning disable
mlt spanning-tree 12 stp 1 learning disable
mlt spanning-tree 13 stp 1 learning disable
mlt spanning-tree 14 stp 1 learning disable
!
! *** AUR ***
!
stack auto-unit-replacement enable
!
! *** AAUR ***
!
stack auto-unit-replacement-image enable
!
! *** Brouter Port ***
!
interface fastEthernet ALL
exit
!
! ACG configuration generation completed
!

SYN ACK · Mar 20, 2009

very strange. 5.1.2 is fairly stable and i had no issues with multiple deployments with anything relavtively simple like this.

i changed positions and no longer have access to all of my nortel labs, gear, etc... so im going by memory.

are you in a position to open a case with nortel? do you have support?

that is very strange that you are losing layer-2 forwarding with DMLT (uplink on each switch) when the base loses power.

losing forwarding for 15-30seconds def. sounds like a spanning tree converging.

is the management IP (switch/stack) on a different VLAN (routed) than your PC? can you try putting your PC on the same VLAN as the management, and confirm the same happens (that you dont lose connectivity)

can you look at STP info after it comes back up (when you reset the base) and see what has converged and how long ago?

SpaceHonkey · Mar 20, 2009

Yes my PC is on a different VLAN than the management IP, so I'll try it again on the management VLAN.

Unfortunately no support contract - just software. These were purchased before Nortel changed their support terms, so we were grandfathered into software (or so I understand).

SpaceHonkey · Mar 20, 2009

Ok - even worse. Before, I was managing via (and pinging) the gateway for the VLAN my desktop was on. This time I put my desktop on a port on VLAN 1, and setup a ping to the actual management IP.

I unplugged the base switch - and couldn't ping the actual management IP - PERIOD. Not until I plugged the base back in could I see it again.

How do I check - ?

can you look at STP info after it comes back up (when you reset the base) and see what has converged and how long ago?

All I've been able to find is TimeSinceTopologyChange in STG in JDM. It does show the time since the outage.

SYN ACK · Mar 20, 2009

remember, the management IP is put on the STACK...hence, it will not be active unless there is a stack.

if you only have 2 switches and you power one off, you no longer have a stack

(management IP no longer reachable)

if you were to have 3 switches and powered one down, you would still be able to ping the management IP.

you can try putting IP addresses on each individual switch under Switch IP, and then have a 3rd IP for the master Stack IP (this is the one you should be using anyways to troubleshoot/manage the switches).

very strange, though.
i may have missed it before...if you have 2 PCs in the same VLAN on the bottom switch (pinging each other) and you power off the base/top switch, do you lose connectivity?

and you say everything else (devices in other VLANs (hence, have to be routed) and devices passing through the uplinks) lose forwarding for about 15-30seconds?

Normally, with the (D)MLT, once a single link comes up, STG is run and after 30 seconds the link starts forwarding...as you add links, STG is already running so there is no more convergence. Same as if there are 4 links and you disable 3 so only 1 is active, there is no STG convergence.

There may be a possibility that since the top switch is the base that STG is reconverging the MLT (with just the bottom-switch link up) , but this still shouldn't happen.

what happens when you power down the bottom switch (and leave base/top running) and run your tests then?

SpaceHonkey · Mar 20, 2009

SYN ACK said:
very strange, though.
i may have missed it before...if you have 2 PCs in the same VLAN on the bottom switch (pinging each other) and you power off the base/top switch, do you lose connectivity?

and you say everything else (devices in other VLANs (hence, have to be routed) and devices passing through the uplinks) lose forwarding for about 15-30seconds?

Tried this just now. On power down - lost pings for about 10 seconds. On power up, after boot (watching via console port) - lost pings for 30 seconds.

When testing before - yes it didn't matter if it was on the same VLAN or routed - lost it either way.

Off to test powering down unit 2...

SpaceHonkey · Mar 20, 2009

Here is a chart of my results when powering down/up unit 2

Code:

			Power Down	Power Up
PC -> PC (Same VLAN)	UP		Down  30 sec
PC -> Gateway		UP		Down  30 sec
MLT  Same VLAN		UP		Down  4 pkts
MLT - Gateway		UP		Down  4 pkts
MLT  Diff VLAN		UP		Down  32 pkts
LACP  Same VLAN	Down  30 sec	Down  30 sec
LACP - Gateway		Down  30 sec	Down  30 sec
LACP  Diff VLAN	Down  30 sec	Down  30 sec

SpaceHonkey · Mar 23, 2009

Bump.

SYN ACK · Mar 23, 2009

can you try adding a seperate 'switch ip' to both switches (and maintain the stack ip) and then see if you lose connectivity to the switch taht doesnt get powered down?

check STG stats in DM or CLI at the time

losing forwarding for 30seconds sounds like STG.

SpaceHonkey · Mar 23, 2009

I'll test this tomorrow - thanks for the help btw!

SpaceHonkey · Mar 24, 2009

Ok - man this is complicated.

I configured each switch with an IP. When a power down occurs, it takes about 10 seconds between losing the stack IP and coming up on the switch IP.

STG Topo only changes if the base unit goes down. HOWEVER - connectivity is still lost for ~10 sec during a power down, and ~30 during power up (as shown in the previous chart).

I'm stumped!

SYN ACK · Mar 24, 2009

is this stack the STG root?

SpaceHonkey · Mar 25, 2009

Yes it is, as it us not connected to any other network equipment yet.

However - I FINALLY FIXED IT!

STP was the root of all evils as suspected. Back when this whole thing got started I tried using RSTP, but couldn't get it working to my satisfaction. Apparently when I changed it back to to standard, I didn't double check that the ports in question had FastStart enabled, or shortTimeout for LACP. I know I checked them at least once, but the fix was to ensure that they were set correctly. That explained why the MLTs worked (only dropped for about a sec), but everything else was kaput.

SO - MLTs = no STP, Ports = FastStart, LACP=shortTimeout.

Now I can power down/up either unit, and only lose a few secs at most of connectivity. Also key to my not understanding this was much of time I was testing connectivity from the server that is using LACP. Since it wasn't working right, it made the problem appear worse. Evidence of this is that the MLT'd blade servers could talk to each other when the LACP server could not.

Thanks SYN ACK, here's an ACK to your wisdom!

SYN ACK · Mar 25, 2009

awesome

SYN ACK · Mar 25, 2009

still is funny that spanning tree converged the entire switch #2 when the base went down.
hmm,

SpaceHonkey · Mar 26, 2009

I don't think it did. Well maybe it did. I'm chalking that up to testing from the LACP'd server. The real problem with the testing I was doing was that I only had a few physical boxes to play with, and I'd have to bounce between rooms to see each box after pulling a plug. That made it pretty damn complicated and VERY hard to keep every result straight. On top of that I had several other things going on at the time that wasn't helping either. I must have bounced that thing 30 times trying to figure out what was going on and now I can't honestly remember much - I'm having to re-read the thread to remember

At least it's working better now and I wont have to worry about HA trying to down a VM because of an outage that should be graceful.

Nortel 5520 Stack problem during failure

Gawd

[H]ard|Gawd

Gawd

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

Gawd

[H]ard|Gawd

Gawd

Gawd

[H]ard|Gawd

Gawd

Gawd

Gawd

[H]ard|Gawd

Gawd

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

[H]ard|Gawd

Gawd