Friday, 12 April 2013

LACP miscellanea

According to the statistics, a few people have stumbled across this blog because they were searching for certain 7750-specific information relating to LACP. Here are a couple of answers that were missed out from my main LACP article:

What is the failover time for a LAG / etherchannel? 

The answer to this question varies considerably depending on the setup. If a device notices a bundled interface going physically down then it should unbundle it immediately, causing very low loss (50ms should be achievable).

In the event of an interface remaining physically up (i.e. where there is transmission equipment or EoMPLS between the two devices), also known as a silent failure, the failover will be up to 3 times the LACP timer. So the impact would be up to 3 seconds using fast timers or up to 90 seconds using slow timers. Most lower end Cisco kit only supports slow timers.

In the event of a single fibre fault or other asymmetric failure, you may see a combination of these effects where traffic in one direction heals faster than the other. There are other corner cases such as when administratively shutting down an interface - some devices send an out of sync LACPDU to inform the other end the link is about to go away which helps speed convergence. It is really best to lab test where possible to check different failure scenarios.

Be aware that when using load balanced LAGs, the impact to some streams may be zero. Typically traffic that is hashed onto one link in the bundle will not suffer loss when a different link in the bundle fails.

How can I transport, rather than terminate, LACP through epipe services on the Alcatel-Lucent 7750? 

The answer to this is pretty straightforward, but I know I looked in the wrong place when I first needed to use the feature.

All you need to do is to configure "lacp-tunnel" under the configure -> port -> ethernet context.

How can I transport, rather than terminate, LACP through a QinQ tunnel on a Cisco switch? 

Again, this is pretty straightforward. There are a load of different protocols that can optionally be tunneled on a dot1q-tunnel port, but we just need lacp enabled:

Simply configure "l2protocol-tunnel point-to-point lacp" under the dot1q-tunnel interface.

Normally it makes sense to tunnel everything (STP, CDP, VTP, LLDP, LACP, PAgP, ...) for consistency. Either be a tunnel or don't!

What is the valid range for LAG IDs on the 7750?

For IOM-based systems (i.e. SR-7, SR-12), the usable LAG ID range is 1 to 200. For integrated IOM systems such as SR-1 and ESS-1, the LAG ID range is 1 to 64.

Can you tell me something unusual about LACP on the 7750?

When  the Rx fibre for a LACP-speaking port loses light (i.e. fails), right before the port gets pulled down the 7750 sends an LACP out-of-sync message to inform the other end that it is going away. This is useful for single fibre faults and can drastically improve convergence times, particularly where transmission equipment between the two LACP peers does not forward link loss.

What does "mux: during state COLLECTING_DISTRIBUTING, got event 5(in_sync) (ignored)" mean in a Cisco debug?

As best I can tell it means that a LACPDU was received with the sync bit set, indicating that the far end is ready to use the link, but the link was already collecting / distributing (i.e. in use) so no change in state was required.

What does "lag number : partner oper state bits changed on member port : [expired false -> true]" mean on a 7750 debug?

This means that a particular port's state machine moved into the expired state due to missing three inbound LACPDUs from the peer. Once the port reaches the expired state it is removed from the bundle but the peer parameters are remembered for a further 3 intervals, after which point the peer information is flushed and the port enters the defaulted state.

Do the LACP keys need to match at both ends of a LAG?

No - the LACP key is locally significant and corresponds one-to-one with a LAG or etherchannel ID. It is used to check consistency (i.e. to catch crossed cables) so must be identical for all members within a LAG, however the devices at each end of the LAG can select any value they like for any particular LAG.

Why do ports get "suspended" from an etherchannel?

Basically a port gets suspended if its configuration is not in line with that of the port channel with which it is associated. The most common way to accidentally arrive in this state is for a member trunk port to have a different allowed VLAN list than its parent port-channel interface. While IOS allows member ports to be reconfigured, it is much more sensible to make the configuration changes to the port-channel interface - the changes are then pushed down to the member ports automatically, avoiding this kind of conflict.