Pretty much anything that has a state is logged somewhere in the Nexus and you can get lost in a labyrinth of cryptic troubleshooting messages related to virtually any process in the switch. In this post I'm focusing on spanning tree logs as they're pretty universal.
Imagine the scenario shown below:
We have three sites connected over the WAN. We blew the budget on dark fibres out of the Cardiff site so we've had to skimp on switches and only have one per site, with the Caerphilly switch being root bridge. The link between Caerphilly and Newport is a metro Ethernet circuit which doesn't forward link loss.
Now imagine there's a failure within the carrier network which results in a total loss of traffic across the circuit between Caerphilly and Newport. No ports go down, however after a short time spanning tree will detect the fault and converge to use the indirect route via Cardiff. If the user's port is in p2p mode rather than edge, he is going to see a 30 second outage while his port transitions back to forwarding, even with RSTP.
How would you even know this had happened (aside from users complaining bitterly)? If you're really wily you may notice your traffic statistics look a bit odd, but if the primary link is restored relatively quickly that kind of thing gets lost in 5 minute roll-ups and natural variation quite easily. Since no interfaces went down, there will be nothing in your logs (by default).
Luckily, the Nexus logs every STP port state transition in its event history and keeps them seemingly forever. If the link flapped 6 months ago there's a good chance you could still prove it, as long as you haven't rebooted the switch. These logs can be retrieved using the command show spanning-tree internal event-history all - note that it's pretty verbose and you probably want to narrow it down if you have a lot of VLANs. The first section for each STP instance is the overall state history, mostly concerned with who the root is and how it is best reached:
Newport# show spanning-tree internal event-history all | begin VLAN0055
VDC01 VLAN0055
<snip>
77) Transition at 643104 usecs after Tue Jul 7 07:44:47 2015
Root: 8037.000c.a45e.321c Cost: 0 Age: 0 Root Port: none Port: none [STP_TREE_EV_MULTI_FLUSH_LOCAL]
78) Transition at 762615 usecs after Tue Jul 7 07:44:49 2015
Root: 8037.000c.a45e.321c Cost: 0 Age: 0 Root Port: none Port: Ethernet1/1 [STP_TREE_EV_UPDATE_TOPO_RCVD_SUP_BPDU]
79) Transition at 763013 usecs after Tue Jul 7 07:44:49 2015
Root: 8037.000c.ac6d.43ba Cost: 4 Age: 0 Root Port: Ethernet1/1 Port: none [STP_TREE_EV_MULTI_FLUSH_LOCAL]
80) Transition at 722769 usecs after Tue Jul 7 07:44:51 2015
Root: 8037.000c.ac6d.43ba Cost: 4 Age: 1 Root Port: Ethernet1/1 Port: Ethernet1/1 [STP_TREE_EV_MULTI_FLUSH_RCVD]
81) Transition at 832764 usecs after Tue Jul 7 07:44:51 2015
Root: 8037.000c.ac6d.43ba Cost: 4 Age: 1 Root Port: Ethernet1/1 Port: Ethernet1/2 [STP_TREE_EV_MULTI_FLUSH_RCVD]
82) Transition at 752841 usecs after Tue Jul 7 07:44:52 2015
Root: 8037.000c.ac6d.43ba Cost: 4 Age: 1 Root Port: Ethernet1/1 Port: Ethernet1/2 [STP_TREE_EV_MULTI_FLUSH_RCVD]
83) Transition at 782964 usecs after Tue Jul 7 07:44:53 2015
Root: 8037.000c.ac6d.43ba Cost: 4 Age: 1 Root Port: Ethernet1/1 Port: Ethernet1/1 [STP_TREE_EV_MULTI_FLUSH_RCVD]
The logs are quite verbose but it's clear to see from the "Root Port: none" message that the primary path to the root was lost, then re-gained within a few seconds. Just a minor flap within the carrier network and a few seconds' impact?
Below the main state history are the individual port histories, let's look at our user's port and see what happened there:
VDC01 VLAN0055 <Ethernet1/10>
<snip>
7) Transition at 762694 usecs after Tue Jul 7 07:44:49 2015
State: BLK Role: Desg Age: 2 Inc: no [STP_PORT_MULTI_STATE_CHANGE]
8) Transition at 640356 usecs after Tue Jul 7 07:45:04 2015
State: LRN Role: Desg Age: 2 Inc: no [STP_PORT_STATE_CHANGE]
9) Transition at 642846 usecs after Tue Jul 7 07:45:19 2015
State: FWD Role: Desg Age: 2 Inc: no [STP_PORT_STATE_CHANGE]
Oh. Right at the same time as the WAN dropped out, our user's port went into blocking for 15s then learning for another 15 before finally transitioning to forwarding again. Ouch... and we never would have known were it not for the STP event history!
Side Note
You can save yourself the effort of reading the incredibly verbose event history by setting the logging level of spanning tree to something more useful, such as informational:
Newport(config)#logging level spanning-tree 6
Note, the logging level for the local buffer or syslog server will need to be set to a level that will record the newly verbose logging.
Also, user ports should be forced into edge mode to avoid STP convergence causing massive disruption to them:
Newport(config-if)#spanning-tree port type edge
The switch should "guess" correctly but it's probably best not to take the chance that a user port accidentally go into p2p mode.
No comments:
Post a Comment