Friday, 21 December 2012

All sorts of things about LACP and LAGs

A lot of people consider link aggregation groups (LAG / etherchannel / portchannel / MLT) to be pretty basic functionality that "just works" and don't really think any more about it. As with many networking technologies, there is a lot of intelligence responsible for creating the smooth veneer of simplicity.

The basic concept of the LAG is that multiple physical ports are combined into one logical bundle. This provides benefits including:
  • Increased capacity - traffic may be balanced across the member ports to provide increased aggregate throughput
  • Link redundancy - the LAG bundle can survive the loss of one or more member links
LAGs may be statically configured or signalled using standards based LACP, which is the main focus of this post. There is also the Port Aggregation Protocol (PAgP), which is similar in many regards to LACP, but is Cisco proprietary and not in common usage. I won't discuss PAgP in this post.

Load Balancing Operation

One important point to bear in mind with LAGs is that traffic is not dynamically assigned across member links but rather is "sprayed" using a deterministic hash algorithm. Depending on the platform and configuration, a number of parameters may feed into the algorithm including:
  • Source and/or destination MAC address
  • Source and/or destination IP address
  • Source and/or destination TCP / UDP port numbers
  • Ingress interface
  • Service ID or MPLS label
  • System specific information (chassis MAC or system IP)
Ultimately the hash will take in some combination of parameters and decide onto which member link the frame should be placed. Note that, since all the input to the algorithm is either permanently static (i.e. chassis MAC) or static for a given flow (i.e. source and destination MAC), all traffic for a particular flow will always be placed onto the same link. This has the following effects:
  • Order is maintained for frames within a flow - the different member links, particularly on a WAN, may have different delay characteristics. If frames for a single flow were sprayed onto multiple member links, frames could be re-ordered in transit.
  • Traffic for a single flow cannot exceed the bandwidth of a single member link.
  • Traffic balance across member links is largely dependant on the diversity of the offered traffic. If the number of flows is low, some links may be saturated while others are under-utilised. The same effect can be seen if there are many flows but load is proportionally concentrated in just a few of them.
  • When traffic passes through multiple hops using LAGs at each stage, polarisation can occur. This is where repeated application of the same hash function at each hop causes traffic to become unevenly distributed across the links. One link may be running at 100% and dropping excess traffic while another is almost idle. Passing system specific information into the algorithm is designed to mitigate this by ensuring that each hop hashes in a slightly different way.
  • Upstream and downstream traffic for a single flow will not necessarily traverse the same link. Since the devices at each end of a LAG hash traffic independently, there is no guarantee that both legs of a conversation will pass along the same member link.

Active / Standby Operation

In addition to the "normal" load balancing mode of operation, it is also possible to configure a LAG to operate in an active/standby fashion. In fact, it is possible to combine the two modes and have an arbitrary number of links active and passing traffic while an arbitrary number remain on standby pending a fault on the active link(s).

Active / standby groups are generally used when resilience is required, but it is not desirable for the LAG to pass more than a certain amount of traffic or for the available bandwidth to vary. Typical use cases are service provider environments where the customer only pays for a certain bandwidth and corporate networks with highly over-subscribed core.

Rules for LAGs

In order to be able to aggregate ports together certain rules must be obeyed. Fundamentally, the member ports must be homogeneous, but more specifically every member port must have the agree on the following:
  • Speed & Duplex - Since traffic is distributed by a simple hash, it is not possible to combine links of different speeds in the same bundle.
  • Encapsulation - i.e. all ports must use the same number of 802.1Q VLAN tags. For switches this means they must all be access or all be trunk. For routers such as the 7750 this means that the Ethernet encap type (null, dot1q or qinq) must agree between members. For switches in access mode, all member ports must be in the same VLAN.
  • For the 7750, the port type (access, network or hybrid) must agree across members and for the LAG
  • MTU - all member port MTUs must match and for Cisco switches, the same MTU must be configured on the port channel.
Note: the physical media type, i.e. copper or fibre, does not necessarily need to match between all LAG members.

Static Configuration

The simplest method of building a LAG does not involve any signalling or protocols at all and simply specifies the member ports to be aggregated. Here's an example of doing that on two different platforms:

Alcatel-Lucent 7750:
A:7750# configure port 2/2/[19..20] ethernet mode access
*A:7750# configure port 2/2/[19..20] ethernet autonegotiate limited
*A:7750# configure port 2/2/[19..20] no shutdown
*A:7750# configure lag 1
*A:7750>config>lag$ mode access
*A:7750>config>lag$ port 2/2/19
*A:7750>config>lag$ port 2/2/20
*A:7750>config>lag$ no shutdown

Cisco 2950:
2950#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
2950(config)#int range fa0/19 - 20
2950(config-if-range)#switchport mode access
2950(config-if-range)#channel-group 1 mode on
Creating a port-channel interface Port-channel 1

2950(config-if-range)#no shut

In this setup, as soon as a port becomes physically up it becomes a member of the LAG bundle. The only, fairly minor, advantage of this is that the configuration is very simple. The disadvantage is that there is no method to detect any kind of cabling or configuration errors.

Note: The lack of any kind of misconfiguration detection makes static LAGs very dangerous to deploy in production networks.


LACP is the standards based protocol used to signal LAGs. It detects and protects the network from a variety of misconfiguration and fault conditions, ensuring that links are only aggregated into a bundle if they are consistently configured and cabled.

LACP must be configured in one of two modes:
  • Active mode - the device immediately sends LACP messages (LACPDUs) when the port comes up and must reach an agreement with the attached port before traffic will pass.
  • Passive mode - the device does not generate LACPDUs until it receives them. If no LACPDUs are received then the port aggregates as though statically configured. If LACPDUs are received then an agreement must be reached with the peer before traffic will pass.
In practice it is rare to find passive mode used in any properly designed network as it should be clearly and consistently defined which links will use LACP ahead of deployment.

Minimal LACP configuration

The minimal configuration is still very straightforward, requiring little additional CLI:

Alcatel-Lucent 7750:
A:7750# configure port 2/2/[19..20] ethernet mode access
*A:7750# configure port 2/2/[19..20] ethernet autonegotiate limited
*A:7750# configure port 2/2/[19..20] no shutdown
*A:7750# configure lag 1
*A:7750>config>lag$ mode access
*A:7750>config>lag$ lacp active

*A:7750>config>lag$ port 2/2/19
*A:7750>config>lag$ port 2/2/20
*A:7750>config>lag$ no shutdown

Cisco 2950:
2950#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
2950(config)#int range fa0/19 - 20
2950(config-if-range)#switchport mode access
2950(config-if-range)#channel-group 1 mode active
Creating a port-channel interface Port-channel 1

2950(config-if-range)#no shut
There is, of course, a lot more going on behind the scenes but most parameters assume default values which are perfectly acceptable for most situations.

LACP Terms and Parameters

There are a number of LACP-specific terms and parameter names that must be understood in order to make sense of LACP debug output and packet traces.

The first and arguably most fundamental concept is that of actors and partners. One of the really nice debugging features of LACP is that it echoes the parameters it receives back to the sender. To avoid confusion, the term actor is used to designate the parameters and flags pertaining to the sending node, while the term partner is used to designate the sending node's view of its peer's parameters and flags.

Per System:
Each network device has a LACP System ID. This is a 48 bit value which generally defaults to the chassis MAC address. The system ID is sent within every LACPDU and makes it easy to check that a LAG goes to the device you expect.

Each device also has a 16 bit LACP System Priority. The system priority is used to decide which system's port priorities are used to decide active / standby in the event that the two peers disagree. Lowest priority wins.

Per LAG:
Each LAG on a system will have a unique 16 bit LACP key, the purpose of which is to differentiate one LAG from another within the protocol. This number is locally significant and may or may not match between peers.The main purpose of the LACP key is to allow a system to detect cabling faults - if different LACP keys are received on members of the same LAG then we are connected to two different LAGs at the far end and, obviously, aggregating those together would be a bad idea.

LACP Flags:
The following flags are used to communicate state between systems:
  • Activity - Set to indicate LACP active mode, cleared to indicate passive mode
  • Timeout - Set to indicate the device is requesting a fast (1s) transmit interval of its partner, cleared to indicate that a slow (30s) transmit interval is being requested.
  • Aggregation - Set to indicate that the port is configured for aggregation (typically always set)
  • Synchronisation - Set to indicate that the system is ready and willing to use this link in the bundle to carry traffic. Cleared to indicate the link is not usable or is in standby mode.
  • Collecting - Set to indicate that traffic received on this interface will be processed by the device. Cleared otherwise.
  • Distributing - Set to indicate that the device is using this link transmit traffic. Cleared otherwise.
  • Expired - Set to indicate that no LACPDUs have been received by the device during the past 3 intervals. Cleared when at least one LACPDU has been received within the past three intervals.
  • Defaulted - When set, indicates that no LACPDUs have been received during the past 6 intervals. Cleared when at least one LACPDU has been received within the past 6 intervals. Once the defaulted flag transitions to set, any stored partner information is flushed. 

Bringing Links into Service

Assuming that the local configuration is consistent and LACPDUs are being exchanged across the link, the following flow chart roughly describes how to decide the value of the synchronisation, distributing and collecting flags.

If by the end your collecting / distributing flags are set then the link will be used for sending and receiving traffic. If not, it won't.

LACP Fault Detection

LACP can detect almost every conceivable patching error and will refuse to aggregate when that would be inappropriate. Following are a number of improper LAG topologies along with a description of how LACP detects and protects the network against them.

Split LAG

In the above scenario, LACP inspects the system ID field of incoming LACPDUs and refuses to aggregate any links whose system ID does not match that of the existing member(s).

Crossed LAGs

In the above scenario, LACP detects the cabling fault by inspecting the key ID on the incoming LACPDUs and refuses to aggregate any links whose key does not match that of the existing member(s).

Looped LAG

In the above scenario, LACP detects the cabling fault by inspecting the system ID and key of the incoming LACPDU. Some systems (e.g. Alcatel-Lucent 7750) allow different LAGs to be interconnected on the same chassis, however it is never allowed for two member ports of the same LAG to be connected.

Unidirectional Link Failure

In the scenario above, a unidirectional link failure has occurred so that LACPDUs are being lost in the direction A to B, but the ports remain physically up. LACPDUs that are lost are indicated in grey. In this situation, system B responds to the loss of three consecutive LACPDUs by clearing its synchronisation, collecting and distributing flags and setting its expired flag. System A responds immediately to the loss of sync by clearing its synchronisation, collecting and distributing flags.

LACP Troubleshooting

The most important part of troubleshooting LAGs is to properly understand the meaning and purpose of all the parameters, particularly the flags, before you begin. After that point, it is just a matter of knowing what CLI commands will show you the required information.

I recommend starting with the basics and working up:
  • Are the member ports physically up?
  • Are all member ports configured consistently (see LAG Rules above)?
  • Can you be sure the topology is as we expect?
    • Use LLDP or CDP if available
    • Use system ID, key and port ID values from the LACPDUs otherwise
  • Determine which end is unhappy (hint, it won't be sending sync).
  • Verify that messages are passing bi-directionally and are not being blocked by any kind of filter (hint, check that the partner details are populated on LACPDUs)
After following these checks you should be able to trace 95% of LAG problems. I, personally, prefer to check the flags, etc, using a packet capture. But then I would, because that's my answer to everything. Below are some CLI methods to gather the same information.

Alcatel-Lucent 7750

To get almost all the information you could ever want, use "show lag [number] detail":

A:7750# show lag 1 detail
LAG Details
Description        : N/A
Lag-id              : 1                     Mode                 : access
Adm                 : up                    Opr                  : up
Thres. Exceeded Cnt : 2                     Port Threshold       : 0
Thres. Last Cleared : 12/21/2012 10:59:59   Threshold Action     : down
Dynamic Cost        : false                 Encap Type           : null
Configured Address  : 00:0a:aa:2e:af:ea     Lag-IfIndex          : 1342177281
Hardware Address    : 00:0a:aa:2e:af:ea     Adapt Qos (access)   : distribute
Hold-time Down      : 0.0 sec               Port Type            : standard
Per FP Ing Queuing  : disabled
LACP                : enabled               Mode                 : active
LACP Transmit Intvl : fast                  LACP xmit stdby      : enabled
Selection Criteria  : highest-count         Slave-to-partner     : disabled
Number of sub-groups: 1                     Forced               : -
System Id           : 00:0a:aa:2e:af:ea     System Priority      : 40960
Admin Key           : 32777                 Oper Key             : 32777
Prtr System Id      : 00:12:da:ab:fe:21     Prtr System Priority : 32768
Prtr Oper Key       : 1
Standby Signaling   : lacp

Port-id        Adm     Act/Stdby Opr     Primary   Sub-group     Forced  Prio
2/2/19         up      active    up      yes       1             -       32768
2/2/20         up      active    up                1             -       32768

Port-id        Role      Exp   Def   Dist  Col   Syn   Aggr  Timeout  Activity
2/2/19         actor     No    No    Yes   Yes   Yes   Yes   Yes      Yes
2/2/19         partner   No    No    Yes   Yes   Yes   Yes   No       Yes
2/2/20         actor     No    No    Yes   Yes   Yes   Yes   Yes      Yes
2/2/20         partner   No    No    Yes   Yes   Yes   Yes   No       Yes

In this output you can see the local and remote flags, system IDs, system priorities and keys in use, whether the underlying ports are functioning and, if sub-groups are in use, whether local ports are active or standby. Note also that it shows you which port in the LAG is primary - if you want to edit anything such as MTU, QoS, etc, then you need to do it on the primary port. Your changes will then be pushed to the other ports automatically.

If you need to verify that LACPDUs are being received, you can use "debug lag [lag-id number] [port port-id] pkt". This will produce a debug message for every LACPDU sent or received, optionally filtered by LAG or by individual port:

A:7750# debug lag lag-id 1 pkt
980 2012/12/21 21:23:56.73 GMT MINOR: DEBUG #2001 Base LAG
Xmit LACPDU on PortId 2/2/19"

981 2012/12/21 21:23:56.80 GMT MINOR: DEBUG #2001 Base LAG
LACPDU rcvd on PortId 2/2/19"

A little light on detail, admittedly, but enough to prove whether they are arriving or not.

For more interactive debugging, a better choice might be "debug lag [lag-id number] [port port-id] sm" to indicate what is happening to the state machine for a given lag or port:

A:7750# debug lag lag-id 1 sm
852 2012/12/21 18:55:37.67 GMT MINOR: DEBUG #2001 Base LAG
LagId 1: partner oper state bits changed on member 2/2/20 : [sync FALSE -> TRUE]

853 2012/12/21 18:55:37.67 GMT MINOR: DEBUG #2001 Base LAG
LagId 1 mem. 2/2/20 :triggerMap 0 -> e after Rx SM"

854 2012/12/21 18:55:37.67 GMT MINOR: DEBUG #2001 Base LAG
LagId 1 mem. 2/2/20 :running selection logic"

855 2012/12/21 18:55:37.67 GMT MINOR: DEBUG #2001 Base LAG

The above is quite verbose as it generates state machine transitions every time a LACPDU is sent or received, but it is really the best way to troubleshoot state transitions.

Cisco 2950

There are a few LACP related show commands on IOS and the useful information is spread between them. Starting at the simple end, a high level overview of the LAGs on the system can be obtained using the command "show etherchannel":

2950#show etherchannel
                Channel-group listing:

Group: 1
Group state = L2
Ports: 2   Maxports = 16
Port-channels: 1 Max Port-channels = 16
Protocol:   LACP


To find the local LACP system ID, use "show lacp sys-id":

2950#show lacp sys-id 

Note that the part before the comma is actually the system priority.

Useful information about the remote device (our partner) can be found using "show lacp neighbor":

2950#show lacp neighbor 
Flags:  S - Device is requesting Slow LACPDUs
        F - Device is requesting Fast LACPDUs
        A - Device is in Active mode       P - Device is in Passive mode

Channel group 1 neighbors
Partner's information:
                  LACP port                        Oper    Port     Port
Port      Flags   Priority  Dev ID         Age     Key     Number   State
Fa0/19    FA      32768     0003.abcd.aaa1   3s    0x8009  0x8894   0x3F
Fa0/20    FA      32768     0003.abcd.aaa1   3s    0x8009  0x8893   0x3F

This shows some useful information such as the timeout and activity flags, plus it allows you to verify the LACP keys being received on each port for consistency. If you need more information, add the "detail" keyword:

2950#show lacp neighbor detail
Flags:  S - Device is requesting Slow LACPDUs
        F - Device is requesting Fast LACPDUs
        A - Device is in Active mode       P - Device is in Passive mode

Channel group 1 neighbors
Partner's information:
          Partner               Partner                     Partner
Port      System ID             Port Number     Age         Flags
Fa0/19     40960,0003.abcd.aaa1  0x8894           11s        FA

          LACP Partner         Partner         Partner
          Port Priority        Oper Key        Port State
          32768                0x8009          0x3F

          Port State Flags Decode:
          Activity:   Timeout:   Aggregation:   Synchronization:
          Active      Long       Yes            Yes

          Collecting:   Distributing:   Defaulted:   Expired:
          Yes           Yes             No           No
          Partner               Partner                     Partner
Port      System ID             Port Number     Age         Flags
Fa0/20     40960,0003.abcd.aaa1  0x8893           11s        FA

          LACP Partner         Partner         Partner
          Port Priority        Oper Key        Port State
          32768                0x8009          0x3F

          Port State Flags Decode:
          Activity:   Timeout:   Aggregation:   Synchronization:
          Active      Long       Yes            Yes

          Collecting:   Distributing:   Defaulted:   Expired:
          Yes           Yes             No           No

Note that contrary to what you might expect, the "Port State Flags Decode" sections (highlighted in red) actually refer to the local flags rather than those being sent by the remote device. As you can see, in this example the remote end is requesting fast timeouts but the local end is requesting slow.

A fairly detailed overview of the local and remote state can be seen using the "show etherchannel detail" command:

2950#show etherchannel detail
                Channel-group listing:

Group: 1
Group state = L2
Ports: 2   Maxports = 16
Port-channels: 1 Max Port-channels = 16
Protocol:   LACP
                Ports in the group:
Port: Fa0/19

Port state    = Up Mstr In-Bndl
Channel group = 1           Mode = Active          Gcchange = -
Port-channel  = Po1         GC   =   -             Pseudo port-channel = Po1
Port index    = 0           Load = 0x00            Protocol =   LACP

Flags:  S - Device is sending Slow LACPDUs   F - Device is sending fast LACPDUs.
        A - Device is in active mode.        P - Device is in passive mode.

Local information:
                            LACP port     Admin     Oper    Port     Port
Port      Flags   State     Priority      Key       Key     Number   State
Fa0/19    SA      bndl      32768         0x1       0x1     0x13     0x3D

Partner's information:
                  LACP port                        Oper    Port     Port
Port      Flags   Priority  Dev ID         Age     Key     Number   State
Fa0/19    FA      32768     0003.abcd.aaa1  26s    0x8009  0x8894   0x3F

Age of the port in the current state: 0d:00h:00m:24s
Port: Fa0/20

Port state    = Up Mstr In-Bndl
Channel group = 1           Mode = Active          Gcchange = -
Port-channel  = Po1         GC   =   -             Pseudo port-channel = Po1
Port index    = 0           Load = 0x00            Protocol =   LACP

Flags:  S - Device is sending Slow LACPDUs   F - Device is sending fast LACPDUs.
        A - Device is in active mode.        P - Device is in passive mode.

Local information:
                            LACP port     Admin     Oper    Port     Port
Port      Flags   State     Priority      Key       Key     Number   State
Fa0/20    SA      bndl      32768         0x1       0x1     0x14     0x3D

Partner's information:
                  LACP port                        Oper    Port     Port
Port      Flags   Priority  Dev ID         Age     Key     Number   State
Fa0/20    FA      32768     0003.abcd.aaa1   0s    0x8009  0x8893   0x3F

Age of the port in the current state: 0d:00h:00m:27s
                Port-channels in the group:

Port-channel: Po1    (Primary Aggregator)
Age of the Port-channel   = 0d:00h:00m:50s
Logical slot/port   = 1/0          Number of ports = 2
HotStandBy port = null
Port state          = Port-channel Ag-Inuse
Protocol            =   LACP

Ports in the Port-channel:
Index   Load   Port     EC state        No of bits
  0     00     Fa0/19   Active             0
  0     00     Fa0/20   Active             0

Time since last port bundled:    0d:00h:00m:28s    Fa0/19

For more interactive troubleshooting, there are debug commands present but be careful - on my (admittedly ancient) switch, LACP debugs were only available chassis-wide and were pretty verbose. The packet level debug ("debug lacp packet") for a single LACPDU is shown below:

2950#debug lacp packet
Link Aggregation Control Protocol packet debugging is on
19w0d: LACP :lacp_bugpak: Send LACP-PDU packet via Fa0/20
19w0d: LACP : packet size: 124
19w0d: LACP: pdu: subtype: 1, version: 1
19w0d: LACP: Act: tlv:1, tlv-len:20, key:0x1, p-pri:0x8000, p:0x14, p-state:0x3D,
s-pri:0x8000, s-mac:0012.da12.abcd
19w0d: LACP: Part: tlv:2, tlv-len:20, key:0x8009, p-pri:0x8000, p:0x8893, p-state:0x3F,
s-pri:0xA000, s-mac:0003.abcd.aaa1
19w0d: LACP: col-tlv:3, col-tlv-len:16, col-max-d:0x8000
19w0d: LACP: term-tlv:0 termr-tlv-len:0

Pretty detailed, so watch your CPU!

A rather useful alternative is "debug lacp fsm" - again this provides a very high volume of output but is the only practical way to see detailed info on state transitions via CLI:

2950#debug lacp fsm
Link Aggregation Control Protocol fsm debugging is on
19w0d:     lacp_mux Fa0/19 - mux: during state WAITING, got event 4(ready)
19w0d: @@@ lacp_mux Fa0/19 - mux: WAITING -> ATTACHED
19w0d: LACP: Fa0/19 lacp_action_mx_attached entered
19w0d: LACP: Fa0/19 Attaching mux to aggregator
19w0d:     lacp_mux Fa0/19 - mux: during state ATTACHED, got event 5(in_sync)
19w0d: @@@ lacp_mux Fa0/19 - mux: ATTACHED -> COLLECTING_DISTRIBUTING
19w0d: LACP: Fa0/19 lacp_action_mx_collecting_distributing entered
19w0d: LACP: Fa0/19 Enabling collecting and distributing
19w0d:     lacp_rx Fa0/19 - rx: during state CURRENT, got event 5(recv_lacpdu)
19w0d: @@@ lacp_rx Fa0/19 - rx: CURRENT
2950# -> CURRENT
19w0d: LACP: Fa0/19 lacp_action_rx_current entered
19w0d:     lacp_mux Fa0/19 - mux: during state COLLECTING_DISTRIBUTING, got event 5(in_sync) (ignored)
19w0d:     lacp_ptx Fa0/19 - ptx: during state FAST_PERIODIC, got event 3(pt_expired)
19w0d: @@@ lacp_ptx Fa0/19 - ptx: FAST_PERIODIC -> PERIODIC_TX
19w0d: LACP: Fa0/19 lacp_action_ptx_fast_periodic_exit entered

Very verbose indeed. Be careful with CPU load.

Frankly, if you can, it is better to troubleshoot with a port mirror and packet capture. The protocol is very good at telling you what it is doing as in addition to the periodic LACPDUs, triggered updates are generated whenever anything material such as sync state changes. Use a capture filter (see previous blog post "tshark one-liners" for more info) when capturing on links with a lot of user data.


The value of the timeout flag sent by a device indicates the interval at which it expects the partner to send LACPDUs. The partner then should honour the request and send at the indicated interval.

The timeout value does not have to agree between peers. While it is not a recommended configuration, it is possible to bring up a LAG with one end sending every second and the other sending every 30 seconds. In this case, the end requesting fast timers will detect a silent failure in under 3 seconds while the end requesting slow timers will take up to 90 seconds to detect the same fault.

The configuration of sub-groups (and even whether to use sub-groups) does not have to agree between peers. The failure characteristics are often better if one end is configured with active / standby subgroups while the other is configured without any subgroups. In that case, as soon as the end with sub-groups decides to switch a new sub-group to active, the partner is already sending sync on all available links and will immediately put traffic onto the newly active sub-group.

The Alcatel-Lucent 7750 (and probably others, I've just not looked) sends an out of sync LACPDU upon detecting a LAG member go physically down. Normally that won't get through to the other end but in the event of a single fibre failure, for example, it serves tot inform the partner that the link is no longer usable and should be removed from the LAG bundle. This improves failover times considerably in the case where link loss is not forwarded (tens or hundreds of milliseconds as compared to 2 - 3 seconds).


If you got this far, you should probably download the IEEE 802.1ax-2008 standard.

Thursday, 22 November 2012

tshark one-liners

Since most of the hits on this blog seem to come from tshark filter related searches, and since I spend a good part of my day either running or analysing packet captures, I thought it might be useful to create a series of "tshark one-liners" in homage to the brilliant "sed one-liners" collection compiled by Eric Pement.

These are capture filters, not display filters, and are equally applicable to Wireshark, tshark and tcpdump, since they all use the same pcap filter syntax. In wireshark the capture filter options are now hidden away and you have to double click on the interface under capture options to set or adjust the filter string.

The filters are broadly grouped by purpose and I will try to add more as I think of them. Please comment if there is something you think I have missed or would like added.

Note: if you want to strip off VLAN, MPLS, PPPoE or GRE headers from an existing pcap file, please see this post: Removing VLAN/MPLS/PPPoE/GRE Encapsulation


Match 802.1D spanning tree:
"ether dst 01:00:c2:00:00:00" (manpages say "ether proto stp" but I've had trouble with that)

Match Cisco PVST+:
"ether dst 01:00:0c:cc:cc:cd"

Match Cisco CDP / VTP / DTP / PAgP / UDLD:
"ether dst 01:00:0c:cc:cc:cc"

Match LLDP:
"ether proto 0x88cc"

Match LACP (slow protocols):
"ether dst 01:80:c2:00:00:02"

General IP
Match host A ( communicating with host B (
"host && host"

Match host A ( communicating with anything on network B (
"host && net"
or, if you don't like CIDR notation:
"host && net mask"

Match ARP:
"ether proto 0x0806"

Match DHCP:"udp port 67 || udp port 68"


Match any traffic with at least one VLAN tag:

Match traffic with exactly one VLAN tag:
"vlan && not vlan"

Match traffic with an SVLAN of 100 and any CVLAN:
"vlan 100 && vlan"

Match traffic where the first VLAN tag has an 802.1p marking of:
0: "vlan && ether[14] & 224 == 0"
1: "vlan && ether[14] & 224 == 32"
2: "vlan && ether[14] & 224 == 64"
3: "vlan && ether[14] & 224 == 96"
4: "vlan && ether[14] & 224 == 128"
5: "vlan && ether[14] & 224 == 160"
6: "vlan && ether[14] & 224 == 192"
7: "vlan && ether[14] & 224 == 224"

Note: to match the second VLAN tag use "vlan && vlan && ether[18] & 224" on the left hand side of the equality.


Match traffic with at least one MPLS label:

Match traffic with exactly one MPLS label (match S bit of first label):
"mpls && ether[16] & 1 == 1"

Match traffic with a first or single label of 12345:
"mpls 12345"

Match traffic with an inner (e.g. service) label of 67890:
"mpls && mpls 67890"

Match traffic with exactly three MPLS labels (e.g. traffic on facility bypass FRR):
"mpls && mpls && mpls && ether[24] & 1 == 1"

Match 6PE traffic:
With transport label: "mpls && mpls 2"
Without transport label (after PHP): "mpls 2"

Match traffic with an EXP marking (on the first label) of:
0: "mpls && ether[16] & 14 == 0"
1: "mpls && ether[16] & 14 == 2"
2: "mpls && ether[16] & 14 == 4"
3: "mpls && ether[16] & 14 == 6"
4: "mpls && ether[16] & 14 == 8"
5: "mpls && ether[16] & 14 == 10"
6: "mpls && ether[16] & 14 == 12"
7: "mpls && ether[16] & 14 == 14"

Note: to match the EXP marking of the second label, use "mpls && mpls && ether[20] & 14" on the left hand side of the equality.


Match any Ethernet multicast:
"ether multicast"

Match IP multicast traffic:
"ip multicast"

Match IGMP traffic:
"ip proto 2" (the manpages say "ip proto igmp" but I've had trouble with that)

Match PIM traffic:
"ip proto 0x67" (the manpages say "ip proto pim" but I've had trouble with that)


Match all OSPF:
"ip proto 89"

Match specific OSPF packet types:
Hello: "ip proto 89 && ip[20:2] == 0x0201"
DBD: "ip proto 89 && ip[20:2] == 0x0202"
LSR: "ip proto 89 && ip[20:2] == 0x0203"
LSU: "ip proto 89 && ip[20:2] == 0x0204"
LSA: "ip proto 89 && ip[20:2] == 0x0205"


Match all IS-IS traffic:

Match specific IS-IS PDU types:
"l1", "l2", "iih", "lsp", "snp", "csnp" or "psnp"


Note: These rules do not handle multi-segment messages very well but they are good enough for most purposes.

Match only BGP OPEN messages:
"tcp port 179 && tcp[50] == 1"

Match only BGP UPDATE messages:
"tcp port 179 && tcp[50] & 5 != 0"

Match only BGP NOTIFICATION messages:
"tcp port 179 && tcp[50] == 3"

Match only BGP KEEPALIVE messages:
"tcp port 179 && tcp[50] == 4"


Match only L2TP control messages:
"udp port 1701 && udp[8:2] & 0x80ff == 0x8002"

Match L2TP control messages for tunnel ID 1234:
"udp port 1701 && udp[8:2] & 0x80ff == 0x8002 && udp[12:2] == 1234"

Match L2TP data messages for tunnel ID 1234:
"udp port 1701 && udp[8:2] & 0x80ff == 0x0002 && udp[10:2] == 1234"

Match L2TP control messages for session ID 5678:
"udp port 1701 && udp[8:2] & 0x80ff == 0x8002 && udp[14:2] == 5678"

Match L2TP data messages for session ID 5678:
"udp port 1701 && udp[8:2] & 0x80ff == 0x0002 && udp[12:2] == 5678"


Note: Offsets will need to be manually increased by 4 bytes  for each VLAN tag or MPLS label present.

Match PPPoE discovery phase (PADI / PADO / PADR / PADS / PADT):

Match PPPoE session phase (i.e. PPP traffic):

Match PPPoE LCP messages:
"pppoes && ether[20:2] == 0xc021"

Match PPPoE CHAP authentication messages:
"pppoes && ether[20:2] == 0xc223"

Monday, 19 November 2012

Using Capture Filters to Match Higher Layer Protocols

In my previous post I went through some of the tricks that can be used to match MPLS and / or 802.1Q tagged traffic in packet filters. That's a great benefit when analysing traffic on carrier networks or large corporate networks but it only goes up to the transport layer (i.e. TCP and UDP port numbers).

Sometimes it's very desirable to filter on upper layer protocol information which has no corresponding parameters in the pcap-filter syntax. Take for example a situation where you are monitoring a busy BGP route reflector where you only want to see NOTIFICATION messages without all the KEEPALIVEs  and UPDATEs cluttering things up. It's possible to match these cases quite easily using a display filter, however your capture files could get quite large in relation to the amount of useful data. Once again it would be nice to be able to restrict at source, using a capture filter.

The following method can be used reliably for some protocols, somewhat reliably for a few and is completely inapplicable to others. In general if your protocol uses a fixed packet format or you want to match part of a fixed-format header then you're in luck.

Many protocols such as RADIUS, encode their parameters using attribute / value pairs (AVPs) or type / length / value (TLV) format, which can present parameters in an arbitrary order. If the parameter you want to match is in an AVP or TLV, your results are likely to be variable at best. Remember that capture filters work on fixed offsets and cannot cycle through parameters until the right one is found. If you're lucky the particular implementation you're looking at may put the AVPs / TLVs into the same order every time and your value may be early enough in the list not to get 'bumped' by other parameters inserted before it. In general, though, this technique is unlikely to work well.


If at all possible, the best approach is to get a few sample packets of the data you want to capture. The captures should be taken from the same point in the network where you intend to run the real mirror to avoid any differences in encapsulation that would throw out the offsets. Generally it's possible to 'seed' such packets by, for example, manually clearing sessions.

In our example, we want to just see the BGP packets which contain a NOTIFICATION message. We start by obtaining a sample capture, obtained by shutting down a BGP session at one end while sniffing at the point where we intend to monitor. Below is the capture we get, with the interesting packet selected:

Here we can see there are two VLAN headers, beyond which we can see the IP details and the expanded decode of the BGP message. Logically, if we want to catch all the NOTIFICATION messages, we need to do the following:
  • Parse and discard the VLAN tags so that IP can be decoded correctly
  • Match only TCP traffic using either source or destination port 179
  • Of this, match only those packets of type NOTIFICATION
Starting at the first line, we can begin to write our capture filter. Assuming that we don't care which VLAN IDs are being used, just that they are present, the following will match traffic with any two VLAN tags:

"vlan && vlan"

Note that this not only matches traffic which has two VLAN headers - it also adjusts the decoding offset. This is critical to the success of the filter as in a normal, untagged frame the IP header would start directly after the Ethernet header at offset 14 (decimal). With two VLAN tags, the IP header will actually be at offset 20 (decimal). By matching the VLAN tags in this way, the capture filter knows that the IP will start further into the frame. The same happens with the "mpls" and "pppoes" keywords, so if you have these headers make sure you match them.

So next we want to make sure that only BGP packets are matched - this is as simple as you would expect using "tcp port 179" - this will match either a source or a destination port of 179 so you don't have to worry about which end initiated the BGP session. Let's add it to the expression:

"vlan && vlan && tcp port 179"

Now this rule will match any double-tagged BGP traffic. The tricky part is that there are no capture filter keywords for matching BGP packet types and we want to do precisely that. The only option remaining for us is to match bytes at a given offset. Eek!

It takes a little getting used to but for fixed format headers it can be very reliable. I find the easiest way to do this is to:
  • Select the field you want to match in Wireshark
  • Find the offset in the packet where that value is stored
  • Set a filter to match the required value at the required offset

So in our example, I have selected the BGP message type. This is at offset 005C hex / 92 decimal and a type of NOTIFICATION is encoded as a byte of value 3. A simple filter to match this would be "ether[92] == 3". Matching this on its own would get all the BGP NOTIFICATIONs, but also a load of other junk so let's combine it with the rest of our filter:

"vlan && vlan && tcp port 179 && ether[92] == 3"

OK, we can be pretty sure now that this will only match genuine NOTIFICATIONs. The BGP header ip to and including the type field is fixed length, so it is not going to move, and the value of 3 always means NOTIFICATION. Now let's test it out with a real capture on the same conversation as above:

root@sniff:~# tshark -i eth1 "vlan && vlan && tcp port 179 && ether[92] == 3"

Running as user "root" and group "root". This could be dangerous.
Capturing on eth1
  0.000000 ->      BGP NOTIFICATION Message
^C1 packet captured

Perfect. Exactly what we wanted to capture!

Note: It's easy to see the offset from the start of the frame by just looking at the packet capture, but where possible you should consider using offsets from IP or TCP. That way, if you want to re-use your filter with more or less encap, you can just add or remove VLANs, MPLS, etc, without having to re-calculate the offsets.

You will have to use your imagination and ingenuity to work out whether this technique can be used to match your interesting traffic reliably. There are many aspects I have not covered which may prove essential, depending on what you are trying to do, for example:
  • It is possible to match multiple byte fields using [offset:size] notation in place of the simple [offset] used in this example
  • It is possible to bitmask values using the normal bitwise operators, so for example to check if the least significant bit of byte 80 is set, the expression "ether[80] & 1 == 1" can be used
  • Offsets within a protocol can be used, i.e. ip[12]. Offsets like this start from the beginning of the layer being referenced.
  • It is possible to put together some very complex filter statements using AND (&&), OR (||) and NOT (!) operators in conjunction with parentheses.
See the pcap-filter manpage for further details. With experimentation you can almost certainly filter out most of the junk even if it is not possible to cut it out altogether.

Final Tip

While you are practising with these filters you will probably find that you make mistakes with offsets and generally defining the filter correctly. One good way to learn and also to prove your filters work before deploying them is to take a live capture at the point where you plan to sniff, then, on a non-production box, use tcpreplay to pass the traffic while capturing with your filter applied.


RFC 4271 - A Border Gateway Protocol 4 (BGP-4) -
pcap-filter manpage -

Sunday, 18 November 2012

Simulating a broken LNS

A common requirement when testing a LAC is to confirm its reaction when various failure codes are returned by the LNS. In theory you would expect the LAC to react to an LNS failure in the same way (i.e. try another) irrespective of the error type or code returned, but as we all know theory and practice don't always align and that is why we test.

I recently had to prove exactly this area of functionality and found that, while it is relatively easy to put an LNS together which will terminate sessions, it's actually quite hard to get a real LNS to return error messages. Would you believe that they appear to be designed not to fail?

So the aim was:
  • To have an 'LNS' which could be configured to reject incoming start control connection requests (SCCRQs)
  • To be able to configure the result code, error code and, to make the packet captures easier to read and more authentic, the error message contained within the StopCCN message
  • Ideally, to be able to service requests arriving on multiple IP addresses
As usual, the answer to this problem turned out to be scapy.


The script shown below does exactly what I needed but doesn't exactly work how you might expect. In order to reduce reconfiguration between test cases I have made it respond to queries arriving on any IP address - it does this by inspecting the incoming SCCRQ's source and destination MAC and IP addresses, then flipping them around on the response. That means that it does not attempt to bind to port 1701 on the host, therefore if the LAC sends an SCCRQ to the host's real IP it will get an ICMP unreachable and a StopCCN back. This is almost certainly not what you want.

The intended use case for this script is to have the LAC attempt to connect to an LNS which is "behind" the host running scapy, i.e. the last hop router should have a static route directing traffic for the LNS via the scapy host, in effect creating the following topology:

Alternatively, you could use a static ARP entry on the gateway router to direct traffic for an address on the attached LAN to the scapy host.


Usage is simple - firstly run scapy, then call 'execfile("")' to load the script. You must create an instance of "LNS" and then, if the defaults to not suit, set the following member values:
 interface (default "eth1")
  • resultcode (default 0)
  • errorcode (default 4)
  • errormessage (default "Internal error")
The script will sit there and close as many sessions as you care to offer it. Press control-C to stop.


root@scapyhost:~/Projects/BrokenLNS# scapy
WARNING: No route found for IPv6 destination :: (no default route?)
Welcome to Scapy (2.0.1)
>>> execfile('')
>>> lns = LNS()
>>> lns.resultcode = 1
>>> lns.errorcode = 6
>>> lns.errormessage = "Oh, no!"
Received L2TP packet from
Got an SCCRQ
Sending spoofed StopCCN from  to
Sent 1 packets.
Received L2TP packet from


import os
# Flags
HIDDEN = 16384
CONTROL = 32768
L = 16384
S = 2048
# Types

# Control Message Types
SCCRQ = '\x00\x01'
SCCRP = '\x00\x02'
StopCCN = '\x00\x04'

def word(value):
# Generates a two byte representation of the provided number

def AVP(bitmask, vendor, attribute_type, data):
# Generates an L2TP AVP using the given attribute number and payload
  length = len(data) + 6
  return(word(bitmask + (length % 1024)) + word(0) + word(attribute_type) + data)

def genL2TP(flags, tunid, sessid, ns, nr, payload):
# Generates an L2TP payload with the given parameters and AVP payload
  length = len(payload) + 12
  return(word(flags | 2) + word(length) + tunid + sessid + word(ns) + word(nr) + payload)

def getAVP(avp, payload):
  loc = 0
  while(loc < len(payload)):
    avp_type = payload[loc+2:loc+6]
    avp_len = ((ord(payload[loc:loc+1]) & 3) * 256) + ord(payload[loc+1:loc+2])
    # Uncomment the following line if you want to see info on every AVP checked
#    print "Got AVP " + str(ord(avp_type[0:1])).zfill(2) + str(ord(avp_type[1:2])).zfill(2)  + str(ord(avp_type[2:3])).zfill(2) + str(ord(avp_type[3:4])).zfill(2) + " of length " + str(avp_len) + " value " + payload[loc+6:loc+avp_len]
    if avp_type == avp:
    loc = loc + avp_len

class LNS(Automaton):
  interface = "eth1"
  resultcode = 0
  errorcode = 4
  errormessage = "Internal error"

# Define possible states
# Since this is so simple we only need one state :)
  def WAIT(self):

# Define transitions
# Transitions from WAIT
  def receive_sccrq(self,pkt):
    if (UDP in pkt) and pkt.dport==1701:
      print "Received L2TP packet from " + pkt[IP].src
      # scapy's built in L2TP handling doesn't deal well with control messages so
      # we just grab the raw data from beyond the UDP header
      payload = pkt[UDP].build_payload()
      # Check what type of L2TP message arrived by chopping off the header and passing
      # the rest to getAVP
      packet_type = getAVP(word(0) + word(CONTROLMESSAGE), payload[12:])
      if(packet_type == SCCRQ):
        # If we get an SCCRQ, generate a StopCCN in response.
        print "Got an SCCRQ"
        client_ip = pkt[IP].src
        server_ip = pkt[IP].dst
        client_mac = pkt[Ether].src
        server_mac = pkt[Ether].dst
        tun_id = getAVP(word(0) + word(TUNNELID), payload[12:])
        print "Sending spoofed StopCCN from " + server_ip + "  to " + client_ip + "."
        sendp(Ether(src=server_mac, dst=client_mac)/IP(src=server_ip, dst=client_ip)/UDP(sport=1701, dport=1701)/Raw(load=genL2TP(CONTROL | L | S, tun_id, word(0), 0, 1, AVP(MANDATORY, 0, CONTROLMESSAGE, StopCCN) + AVP(MANDATORY, 0, ERRORMESSAGE, word(self.resultcode) + word(self.errorcode) + self.errormessage) + AVP(MANDATORY, 0, TUNNELID, word(12345)))), iface=self.interface)
        raise self.WAIT()
      elif(packet_type == SCCRP):
        print "is an SCCRP"
      elif(packet_type == StopCCN):
        print "is a StopCCN"
        print "is a ZLB or non-control message"

Sunday, 28 October 2012

Using Scapy to test PPPoE AC-Cookie validation

AC-Cookies are a mechanism designed to help mitigate certain denial of service attacks against PPPoE access concentrators. To understand the function it is important to first understand the normal flow of the PPPoE discovery process, which is as follows:

  1. The PPPoE client sends a broadcast PADI (initiate) message
  2. Any PPPoE access concentrators willing to service the client respond with a unicast PADO (offer) message
  3. The client selects which access concentrator to use and unicasts a PADR (request) message asking for a session to be established
  4. The access concentrator unicasts a PADS (session) message to the client to indicate that the session has been established
If an attacker is able to spoof PADI and PADR messages from a number of MAC addresses, a large amount of PPPoE state can be created in the access concentrator. An AC-Cookie is an unpredictable (to the client) value which is attached to the PADO message which must be echoed back in the PADR in order for it to be accepted by the access concentrator. Since the AC-Cookie cannot be predicted by the client, if the correct value is echoed back to the concentrator then it is extremely unlikely to have been spoofed and it is therefore safe for the access concentrator to allocate resources to the session.

This is all well and good but what if you need to prove the mechanism works or to show what error messages that are generated on the receipt of invalid AC-Cookies? As usual with my blog posts I have had to do this so I thought I would share the code. It's not going to win any awards but it works, all you need is scapy (I use 2.0.1, later should be fine).

Usage is pretty straightforward; simply run scapy, instantiate an object of type PPPoESession, override options as appropriate and then instruct it to "run()".

For example, to verify that a valid PPPoE session will come up:

root@client-pc:~/Projects/PPPoED# scapy
Welcome to Scapy (2.0.1)
>>> execfile("")

>>> p=PPPoESession()
>>> p.outif="eth0"
[ debugging messages removed ]

Received PADS

Once the PADS is received, the process is complete and control returns to the console.

To verify that the access concentrator checks the value of AC-Cookies returned in PADR messages, we can set the script to reply using garbage values for the AC-Cookie tag as follows:

root@client-pc:~/Projects/PPPoED# scapy
Welcome to Scapy (2.0.1)
>>> execfile("")
>>> p=PPPoESession()
>>> p.randomcookie=True
>>> p.retries=200

This will send a normal PADI and wait for a PADO before sending, up to the configured number of retries, PADR messages with randomised AC-Cookie tag values. When a PADS is received or the number of retries is exceeded, control returns to the console.


RFC 2516, Section 9 -


import os
class PPPoESession(Automaton):
  randomcookie = False
  retries = 100
  sess_id = 0
# Method to recover an AC-Cookie from the tags
  def getcookie(self, payload):
    loc = 0
    while(loc < len(payload)):
      att_type = payload[loc:loc+2]
      att_len = (256 * ord(payload[loc+2:loc+3])) + ord(payload[loc+3:loc+4])
      print "Got attribute " + str(ord(att_type[:1])).zfill(2) + str(ord(att_type[1:])).zfill(2)  + " of length " + str(att_len) + " value " + payload[loc+4:loc+4+att_len]
      if att_type == "\x01\x04":
        self.ac_cookie = payload[loc+4:loc+4+att_len]
        print "Got AC-Cookie of " + self.ac_cookie
      loc = loc + att_len + 4
# Define possible states
  def START(self):
  def WAIT_PADO(self):
  def GOT_PADO(self):
  def WAIT_PADS(self):
  def ERROR(self):
  def END(self):
# Define transitions
# Transitions from START
  def send_padi(self):
    print "Send PADI"
    sendp(Ether(src=self.mac, dst="ff:ff:ff:ff:ff:ff")/PPPoED()/Raw(load='\x01\x01\x00\x00'+'\x01\x03\x00\x04',iface=self.outif)
    raise self.WAIT_PADO()
# Transitions from WAIT_PADO
  @ATMT.timeout(WAIT_PADO, 3)
  def timeout_pado(self):
    print "Timed out waiting for PADO"
    self.retries -= 1
    if(self.retries < 0):
      print "Too many retries, aborting."
      raise self.ERROR()
    raise self.START()
  def receive_pado(self,pkt):
    if (PPPoED in pkt) and (pkt[PPPoED].code==7):
      print "Received PADO"
      raise self.GOT_PADO()
# Transitions from GOT_PADO
  def send_padr(self):
    print "Send PADR"
      print "Random cookie being used"
    sendp(Ether(src=self.mac, dst=self.ac_mac)/PPPoED(code=25)/Raw(load='\x01\x01\x00\x00'+'\x01\x03\x00\x04''\x01\x04\x00'+chr(len(self.ac_cookie))+self.ac_cookie),iface=self.outif)
    raise self.WAIT_PADS()
# Transitions from WAIT_PADS
  @ATMT.timeout(WAIT_PADS, 1)
  def timeout_pads(self):
    print "Timed out waiting for PADS"
    self.retries -= 1
    if(self.retries < 0):
      print "Too many retries, aborting."
      raise self.ERROR()
    raise self.GOT_PADO()
  def receive_pads(self,pkt):
    if (PPPoED in pkt) and (pkt[PPPoED].code==101):
      print "Received PADS"
      self.sess_id = pkt[PPPoED].sessionid
      raise self.END()
  def receive_padt(self,pkt):
    if (PPPoED in pkt) and (pkt[PPPoED].code==167):
      print "Received PADT"
      raise self.ERROR()

Friday, 31 August 2012

A quickie about logging in PuTTY

As a rule, I log every CLI session I ever run. It has saved my neck on countless occasions and just this morning saved me a couple of hours of effort tallying serial numbers. It's surprising how often I use these logs to find some forgotten bit of info or to prove that "it really was all working when I left it"!

Every saved session I create in PuTTY is configured to log into a common directory using a file name in the format year-month-day-time-hostname.log, which gives me a nice, searchable, date-sorted record of what I typed and saw for every session I've ever opened.

The best way to do this is by just configuring Session -> Logging to "All session output" and the file name to something appropriate (the string I use is "&Y-&M-&D-&T-&H.log"). If you like, while you're there, go into Window and set the scrollback buffer to 2 million lines. Finally, return to the Session screen and save as "Default Settings".

From then on all your sessions you create, including ones where you just enter the IP and connect, will be logged. Note that any pre-existing saved sessions will need to be edited if they were not originally set to log.

Do it now, you know it makes sense!

Monday, 23 January 2012

Using Capture Filters with Encapsulated Packets

One of the most annoying things I found when I started working on carrier networks was that while Wireshark's display filters worked perfectly, the capture filters frequently did not. I would regularly set up a capture filter only to find that no packets at all were saved - that's a real pain if you want to pull a few easily described packets out of a 50 Mbps stream across a period of 20 minutes.

After a while I realised that my problem was related to encapsulation. Unlike the hierarchical and detailed display filters, capture filters have to be really fast - that basically means using bit masks and comparing values at fixed offsets. With plain old untagged Ethernet frames the filters work fine, however as soon as you add 802.1Q tags, PPP or MPLS suddenly all the offsets are no longer valid and anything you match will be purely coincidental.

Luckily there are filter keywords to handle that situation. All of the following adjust the offsets for you each time they are used:

vlan [x] - matches a single VLAN tag, the ID of which may optionally be specified by the user
pppoes - matches a PPPoE session header
mpls [x] - matches a single MPLS label, the number of which may optionally be specified by the user

These are very flexible - for example if you are capturing QinQ traffic, you could match all the SMTP packets using:

vlan && vlan && tcp port 25

If you know the VLAN IDs (or MPLS labels) in use, you can narrow the selection based on those. To show all the IGMP passing over a particular MPLS pseudowire with VLAN ID 200, you could use:

mpls 131066 && mpls 131068 && vlan 200 && pppoes && ip proto 2

For a long time I was using makeshift capture filters along the lines of "ether[39] = 2" to match pertinent bytes in the packet (see my next blog post for info on that) however you will probably agree this is much simpler. These filters are equally applicable to Wireshark, Tshark and tcpdump so they may be useful even when forced to capture using some really obscure UNIX box. For Tshark and tcpdump don't forget to put quotes around any expressions that use the ampersand (&).

Friday, 13 January 2012

IGMP Testing, part 1

Maybe it's just my famous inability to find things that are right in front of me but I've needed some tools over the last week that would let me 'play' with IGMP and I've drawn (almost) a total blank.

Firstly, I wanted to generate a good old-fashioned flood of reports to test processing performance and rate limiting.

Plan A was to use the tester for this - despite not having a specific tool for flood testing it does let you create streams of, more or less arbitrary, hand crafted packets. That's the theory, anyway. After carefully putting together a stream profile that should have given me join after join for cycling group numbers I put it to the test - only to find that it had other ideas and was generating complete garbage. By garbage I mean not even the IP headers were correct - the protocol was coming out set to 0xfd (unknown) rather than 0x02 for IGMP and, strangely, the source and destination IPs were populated with the group ID and source that should have been in the report payload. Based on bitter past experiences I didn't waste my time trying to fix that.

OK, time for plan B - back to the packet crafting on a PC. I thought I'd be spoiled for choice but, for Linux anyway, the only option for generating arbitrary IGMP seemed to be nemesis. Nemesis seems to be exactly what I want but it is no longer maintained and won't compile on a modem distro - at least *I* couldn't get it to compile.

My favourite scapy knows what IGMP is from its protocol ID but doesn't have a stack for it, so there was no straightforward way to use that.

Dead end. I couldn't find anything to build me one packet let alone throw 1000 out per second.

Then it occurred that, actually, in normal use the tester can generate valid joins at a civil pace... So I mirrored the tester port and sniffed a genuine join off the wire, whittled the capture file down to the single frame I wanted and fed it to tcpreplay. Yay.

One small problem - it could only manage 100pps and I needed 1000. I noticed it was generating a message every time it sent a packet saying it had re-opened the file, which gave me a hunch that the file operations and CLI might be a bottleneck. I solved that problem the same way as the first - by sniffing the 100pps output for a while and then replaying *that* at full tilt. 960pps... Not quite 1000pps but close enough!

At the last moment it occurred to me that it would be a more convincing test if I cycled the group IDs rather than always reporting on one group. I went back to scapy and, with Wireshark in the other hand, started to play. I thought if I just loaded in the original join packet I could use a loop to tweak a byte or two for the group ID and dump it out to a file which I could then replay.

When I did that I noticed that my router still only showed one group as joined. Rubbish. I had obviously missed something. Looking at the generated file in Wireshark I could see that its checksum was incorrect.

Scapy could re-calculate the IP checksum for me but it didn't understand IGMP so that was going to be a programming exercise. The checksum is only 2 bytes in the payload so it wasn't too hard to adjust. I won't bore you with the maths, check out RFC 3376 if you're curious.

Finally, with that done, I had a pcap file full of valid joins over 100 groups and the ability to fire them out at (nearly) 1000pps. I can't help thinking it should have been easier, though!

Source code to follow - it's very scruffy and fairly fragile but might be useful to someone else... You never know!