Networking Bodges: June 2015

Some time ago I created "stripe", a tool for stripping back layers of encapsulation headers from PCAP files leaving plain payload (typically IP) over Ethernet. "stripe" works with a variety of encapsulation types, from simple VLAN tags up to GRE and GTP, however one thing that stripe couldn't handle was if packets were fragmented after being encapsulated.

One user, LisbethS, suggested that I should build in IP fragment reassembly capabilities into stripe, however my first thought was that the two should be separate utilities. As I thought about it more, though, I realised that the two functions (decapsulation and re-assembly) were actually intertwined - if you treat them as separate processes then you can't re-assemble IP that is encapsulated within something else, nor can you decapsulate GRE or GTP that has been fragmented. The key, then, was to do both re-assembly and decapsulation as part of the same process.

Reassembling Packet Fragments

RFC 815 describes a minimal way to re-assemble IP fragments which is not that complex in principle, so I thought I'd add the functionality. I soon realised it wasn't quite as straightforward as I thought and you have to be very careful about the order of operations.

For example, if you have a packet that gets encapsulated then subsequently fragmented, then trying to decapsulate without reassembling first will fail (the first fragment decapsulates to a partial frame, then the subsequent fragment(s) fail to decapsulate). On the other hand, if you re-assemble first then decapsulate then you don't catch the case where a packet is fragmented before encapsulation. Neither approach can catch the case where a packet is fragmented, then encapsulated, then subsequently fragmented again.

To cut a long story short, the answer appears to be that you need to iteratively re-assemble, decapsulate until fragments are found, re-assemble again, decapsulate again... until there are no more fragments and everything is fully decapsulated.

Anyway, stripe now does both decapsulation and IP fragment re-assembly, meaning that it can take a pcap file containing fragmented and / or encapsulated packets, strip off all the encapsulation and re-assemble the fragments and write out the result to a new pcap file.

Download

The latest version is available for download at https://github.com/theclam/stripe - it is available as source code (compiles without dependancies in almost any Linux distro) and there are also Mac and Windows binaries for easy download.

UPDATE:

I've managed to recreate some of the SEGFAULTs that people have been kindly reporting to me. It turns out there was a typo / n00b mistake (I'm not sure which, most of this is coded way too late at night) which I have now corrected. If you tried before and got an error, it may be fixed now. The memory leaks have also been reduced from "raging" to "moderate" :)

Working on MPLS networks, particularly with VRFs, there are a few questions that come up time and time again about traceroutes.

In this post I'll try to provide answers to such questions as:

Why do only two of my service provider's hops show in a traceroute?
I am seeing loss / high latency on my traceroute as soon as it enters my provider's MPLS core. The provider says it's due to the destination site's link being congested but it appears way before that in the trace. What's going on?
My firewall is attached to an MPLS provider's PE and it is complaining about spoofed packets. When I examine the packets I see they are ICMP unreachables destined to another site on the MPLS WAN. Why would I see this?
What does it mean when my traceroute shows MPLS: whatever?

To understand the answer to any of these questions, you will need to appreciate a) how traceroutes work (check Wikipedia if unsure) and b) the concept of IP / MPLS TTL propagation.

Life Without IP / MPLS TTL Propagation

IP and MPLS both have a TTL field which is used to prevent routing loops from causing packets to circulate forever. IP / MPLS TTL propagation is an optional mechanism which copies the IP TTL into the MPLS shim header upon encapsulation, and copies it back from the MPLS shim header into the IP packet when decapsulation occurs.

In order to see why we'd ever want TTL propagation, let's take a worked example of CE1 tracing to CE2 without TTL propagation.

Hop 1

The first hop is pretty straightforward. CE1 sends a packet towards CE2 with a TTL of 1.

PE1 receives the packet and decrements the TTL, which reaches zero. PE1 sends a TTL expired message to the sender (CE1).

Hop 2

Next, CE1 sends a packet towards CE2 with an IP TTL of 2.

PE1 decrements the TTL, finds it to be non-zero (1) and decides to forward the packet on via MPLS. Since IP MPLS TTL propagation is disabled, the MPLS shim header gets a default TTL of 255.

P1 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (254) and label switches it.

P2 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (253) and label switches it.

PE2 receives the labelled frame, pops the MPLS label off and looks at the IP within. When PE2 decrements the IP TTL it becomes zero, so a TTL expired message is sent to the sender. Note, since PE2 has an interface address within the VRF, the TTL expired message is sourced from that address rather than the interface where the original packet entered.

Hop 3

Finally, CE1 sends a packet towards CE2 with an IP TTL of 3.

PE1 decrements the TTL, finds it to be non-zero (2) and decides to forward the packet on via MPLS. Since IP MPLS TTL propagation is disabled, the MPLS shim header again gets a default TTL of 255.

P1 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (254) and label switches it.

P2 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (253) and label switches it.

PE2 receives the labelled frame, pops the MPLS label off and looks at the IP within. PE2 decrements the IP TTL, finds it to be non-zero (1) and forwards the packet.

CE2 receives the packet, finds that it is destined to itself and responds to the sender, completing the traceroute.

The Output

As you can see, with TTL propagation disabled, the entry and exit PEs show up in the traceroute but pure label switched hops (P routers) do not:

CE1#trace 2.2.2.2

Type escape sequence to abort.
Tracing the route to lo-0.ce2.wormtail.co.uk (2.2.2.2)

1 ge-2-1.pe1.wormtail.co.uk (192.168.1.1) 4 msec 4 msec 2 msec
2 ge-2-1.pe2.wormtail.co.uk (192.168.2.1) 4 msec 4 msec 4 msec
3 ge-0.ce2.wormtail.co.uk (192.168.2.100) 8 msec 7 msec 6 msec
CE1#

Sometimes that's desirable, particularly if the carrier wants to hide how many hops it takes to get from A to B.

In any case, we can answer the first question:

Why do only two of my service provider's hops show in a traceroute?

Your provider has disabled IP/MPLS TTL propagation, meaning that label-switched hops do not appear in traceroutes.

Enabling TTL Propagation

Let's consider a second worked example showing the same traceroute running over the same network but with TTL propagation enabled.

Hop 1

The first hop doesn't get as far as being MPLS encapsulated so it behaves exactly the same as with propagation disabled, i.e. CE1 sends a packet towards CE2 with an IP TTL of 1, PE1 decrements that to zero and generates a TTL expired.

Hop 2

Now CE1 sends a packet towards CE2 with an IP TTL of 2:

PE1 receives the packet and decrements the TTL to 1. This is non-zero so the packet is forwarded using MPLS. Since IP - MPLS TTL propagation is enabled on PE1, the MPLS TTL is copied from the already decremented IP TTL value.

Now, when P1 receives the labelled packet it decrements the TTL and finds zero. P1 sends a TTL expired message towards the sender.

Hop 3

Hop 3 behaves in essentially the same manner as hop 2 - the MPLS TTL decrements to zero and a TTL expired message is sent to the sender.

Hop 4

Now CE1 sends a packet towards CE2 with an IP TTL of 4:

The behaviour to note here is that when PE2 pops the MPLS label, it copies the MPLS TTL back into the IP TTL. The IP TTL is now 1, as if it had been decremented at each hop. Finally, PE2 decrements the new IP TTL, finds that it is zero and sends a TTL expired.

Hop 5

Finally, CE1 sends a packet towards CE2 with an IP TTL of 5:

Essentially, PE1 decrements and copies into the MPLS TTL, each label switch hop decrements the MPLS TTL, PE2 copies the MPLS TTL back into the IP TTL, then forwards the packet on to CE2. CE2 realises that the packet is destined to itself and sends a response to CE1.

In this case, every hop is visible in the traceroute because the TTL behaves consistently and is decremented at each hop:

CE1#trace 2.2.2.2

Type escape sequence to abort.
Tracing the route to lo-0.ce2.wormtail.co.uk (2.2.2.2)

1 ge-2-1.pe1.wormtail.co.uk (192.168.1.1) 6 msec 4 msec 3 msec
2 xe-1-1.p1.wormtail.co.uk (10.1.100.2) 4 msec 4 msec 4 msec
3 xe-1-1.p2.wormtail.co.uk (10.100.101.2) 5 msec 4 msec 5 msec
4 ge-2-1.pe2.wormtail.co.uk (192.168.2.1) 2 msec 7 msec 4 msec
5 ge-0.ce2.wormtail.co.uk (192.168.2.100) 6 msec 6 msec 4 msec
CE1#

Ah, a beautiful traceroute with all the hops on... now we can troubleshoot stuff :)

Hold On!

Now, this all seems good at first glance, but we've forgotten something. One of the greatest features of MPLS VRFs is that the intelligence is only needed around the edge. - The entry / exit points of the network (PEs) need to know about a VRF's routes, but the P routers switching labels in the middle only need to know how to reach the PEs and don't have any VRF routes at all.

So how do the P routers reply to traceroutes run within a VRF? They can't look in the VRF routing table to decide how to return the unreachable message to the host, as they don't have a VRF table. Maybe they see the label that is in use and somehow know what the return label should be? Nope - label-switched paths are unidirectional, so there's no way for a P node to know what labels it would need to attach to an unreachable message to get it back to the sender.

The answer is that the P routers play a clever little trick. The P router doesn't know how to reach the sender, but if it forwards a TTL expired message in the same way as it would have done the original packet eventually it will reach the egress PE which does know how to reach the sender. It sounds a little counter-intuitive, so let's have a diagram:

P1 receives a labelled packet, decrements the MPLS TTL and finds it to be zero. P1 doesn't know anything about the VRF where this traffic originated so it peels off all the labels and looks at the IP packet inside. It generates a TTL expired message ready to respond to the sender and copies the label stack from the original packet onto the message it has just generated. Then it label switches that instead of the original packet. This means it gets sent onwards to P2, then PE2, where the labels are removed and the IP packet can be routed. Seeing that the ICMP TTL expired message is destined for CE1, PE2 applies the appropriate label and sends it back into the network.

Indirect FECs

Now, as if that wasn't quirky enough, we will look at a very similar setup but where the destination of the traceroute is not on the PE's directly attached network but is one hop away:

For efficiency reasons, many vendors allocate a label per next-hop rather than a label per VRF. When the PE receives such a label, it automatically adds the encapsulation for the next hop device's MAC and throws the frame out of the appropriate interface without any kind of routing lookup. Hurray, we burnt a label to save a TCAM lookup.

This has the effect of extending the quirk in the previous example - because the PE does not make a routing decision for these packets, the U-turning of ICMP unreachables occurs one hop further along, at the CE:

No big deal, you might think, but remember: the difference between a PE and a CE is that the PE is normally in the provider's PoP on big fat 10 Gbps links while the CE could be out in the sticks somewhere on the end of a 2 Mbps DSL line... U-turning at the CE means that responses from label-switched hops near the beginning of a trace will appear to inherit any packet loss and / or latency being experienced by hops further along, up to and including the PE - CE link.

I am seeing loss / high latency on my traceroute as soon as it enters my provider's MPLS core. The provider says it's due to the destination site's link being congested but it appears way before that in the trace. What's going on?

When you traceroute to a prefix beyond a CE device, the unreachables from label-switched hops usually have to be U-turned by the CE device, which is on the other end of the congested link. Therefore label-switched hops will often show packet loss / high latency in a traceroute even though the issue is further downstream.

Weird Firewall Logs

If we consider MPLS into the datacentre, we can often tweak the example above and replace the CE router with a firewall:

Now, to recap:

Label-switched hops don't have routing knowledge to send a TTL expired message, so they attach the forward path label
The PE knows that the label corresponds to a prefix behind the firewall so it encapsulates with the firewall's MAC as the destination and throws it straight out of the interface

So the firewall receives a packet on its WAN interface which is destined for something out of its WAN interface. Most firewalls don't like that - Cisco ASA firewalls, for example, by default will not route any traffic back out of the same interface where it entered.

The traffic split-horizon rule can be overridden in config, but there is another issue to overcome - any self-respecting firewall will perform uRPF checks on incoming packets before too much else happens. Unless the firewall has a default route pointing out into MPLS land then we are going to have a problem when packets arrive from the carrier's label switched core, often sourced from a public IP.

Finally, unless your firewall rules are pretty sloppy, the traffic will be dropped by filter anyway!

Therefore, the firewall logs spoofed packets or at least dropped ICMP unreachable packets with a source IP of the carrier's PE and a destination IP of some device in a remote site.

So, mystery solved:

My firewall is attached to an MPLS provider's PE and it is complaining about spoofed packets. When I examine the packets I see they are ICMP unreachables destined to another site on the MPLS WAN. Why would I see this?

Your carrier is using label per next-hop with IP - MPLS TTL propagation and some remote device tried to traceroute towards an address behind your firewall.

Though, if you jumped to this section, you probably want to read the preceding sections to make sense of that!

MPLS Information in Traceroutes

If your router supports the feature, you may see MPLS labels quoted in a traceroute, as shown here:

PE1#trace 10.255.255.2
Type escape sequence to abort.
Tracing the route to lo-0.pe2.wormtail.co.uk (10.255.255.2)
VRF info: (vrf in name/id, vrf out name/id)
1 xe-1-1.p1.wormtail.co.uk (10.1.100.2) [MPLS: Label 17 Exp 0] 1 msec 6 msec 2 msec
2 xe-1-1.p2.wormtail.co.uk (10.100.101.2) [MPLS: Label 42 Exp 0] 6 msec 2 msec 2 msec
3 xe-1-1.pe2.wormtail.co.uk (10.2.101.1) 4 msec 4 msec 3 msec
PE1#

Having the labels can be useful for troubleshooting, not that I've ever used them. The presence of the EXP marking could be useful for debugging QoS issues, at least. I always assumed that there was some kind of MPLS trick used to relay the information back but then I noticed you sometimes get it when you trace through an external provider's network which hands over to you on IP.

It's actually done using ICMP extensions for MPLS (RFC 4950) - basically the whole MPLS label stack as it arrived at the label switch router is wrapped up in an ICMP extension header and copied wholesale into the unreachable or TTL expired message. As long as the ICMP message gets to you, so will the label stack.

Most hosts don't support the extension, so you just see a normal-looking traceroute result, but recent routers generally do and can decode the information and present it to you.

Here's a sample packet inspected in Wireshark:

So:

What does it mean when my traceroute shows MPLS: whatever?

Your traceroute has run through an MPLS backbone somewhere, the MPLS label stack is included in the unreachable message for troubleshooting purposes.

Fairly self explanatory, but good to know.

References

Wikipedia article on Traceroute
RFC 4950 (MPLS extensions to ICMP)
Per CE and per-VRF labelling mode on Cisco IOS-XR

Networking Bodges

Friday, 19 June 2015

JunOS Traceroute Error Codes

Monday, 15 June 2015

Re-assembling IP Fragments in PCAP Files

Reassembling Packet Fragments

Download

Sunday, 14 June 2015

Quirks of Traceroute over MPLS Networks

Life Without IP / MPLS TTL Propagation

Hop 1

Hop 2

Hop 3

The Output

Enabling TTL Propagation

Hop 1

Hop 2

Hop 3

Hop 4

Hop 5

Hold On!

Indirect FECs

Weird Firewall Logs

MPLS Information in Traceroutes

References