Sunday 14 June 2015

Quirks of Traceroute over MPLS Networks

Working on MPLS networks, particularly with VRFs, there are a few questions that come up time and time again about traceroutes. 

In this post I'll try to provide answers to such questions as:
  • Why do only two of my service provider's hops show in a traceroute?
  • I am seeing loss / high latency on my traceroute as soon as it enters my provider's MPLS core. The provider says it's due to the destination site's link being congested but it appears way before that in the trace. What's going on?
  • My firewall is attached to an MPLS provider's PE and it is complaining about spoofed packets. When I examine the packets I see they are ICMP unreachables destined to another site on the MPLS WAN. Why would I see this? 
  • What does it mean when my traceroute shows MPLS: whatever?

To understand the answer to any of these questions, you will need to appreciate a) how traceroutes work (check Wikipedia if unsure) and b) the concept of IP / MPLS TTL propagation.

Life Without IP / MPLS TTL Propagation


IP and MPLS both have a TTL field which is used to prevent routing loops from causing packets to circulate forever. IP / MPLS TTL propagation is an optional mechanism which copies the IP TTL into the MPLS shim header upon encapsulation, and copies it back from the MPLS shim header into the IP packet when decapsulation occurs. 

In order to see why we'd ever want TTL propagation, let's take a worked example of CE1 tracing to CE2 without TTL propagation.

Hop 1


The first hop is pretty straightforward. CE1 sends a packet towards CE2 with a TTL of 1.


PE1 receives the packet and decrements the TTL, which reaches zero. PE1 sends a TTL expired message to the sender (CE1).

Hop 2


Next, CE1 sends a packet towards CE2 with an IP TTL of 2.


PE1 decrements the TTL, finds it to be non-zero (1) and decides to forward the packet on via MPLS. Since IP MPLS TTL propagation is disabled, the MPLS shim header gets a default TTL of 255.

P1 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (254) and label switches it.

P2 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (253) and label switches it.

PE2 receives the labelled frame, pops the MPLS label off and looks at the IP within. When PE2 decrements the IP TTL it becomes zero, so a TTL expired message is sent to the sender. Note, since PE2 has an interface address within the VRF, the TTL expired message is sourced from that address rather than the interface where the original packet entered.

Hop 3


Finally, CE1 sends a packet towards CE2 with an IP TTL of 3.


PE1 decrements the TTL, finds it to be non-zero (2) and decides to forward the packet on via MPLS. Since IP MPLS TTL propagation is disabled, the MPLS shim header again gets a default TTL of 255.

P1 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (254) and label switches it.

P2 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (253) and label switches it.

PE2 receives the labelled frame, pops the MPLS label off and looks at the IP within. PE2 decrements the IP TTL, finds it to be non-zero (1) and forwards the packet.

CE2 receives the packet, finds that it is destined to itself and responds to the sender, completing the traceroute.

The Output


As you can see, with TTL propagation disabled, the entry and exit PEs show up in the traceroute but pure label switched hops (P routers) do not:

CE1#trace 2.2.2.2

Type escape sequence to abort.
Tracing the route to lo-0.ce2.wormtail.co.uk (2.2.2.2)

  1 ge-2-1.pe1.wormtail.co.uk (192.168.1.1) 4 msec 4 msec 2 msec
  2 ge-2-1.pe2.wormtail.co.uk (192.168.2.1) 4 msec 4 msec 4 msec
  3 ge-0.ce2.wormtail.co.uk (192.168.2.100) 8 msec 7 msec 6 msec
CE1#

Sometimes that's desirable, particularly if the carrier wants to hide how many hops it takes to get from A to B.

In any case, we can answer the first question:
Why do only two of my service provider's hops show in a traceroute?

Your provider has disabled IP/MPLS TTL propagation, meaning that label-switched hops do not appear in traceroutes.

Enabling TTL Propagation


Let's consider a second worked example showing the same traceroute running over the same network but with TTL propagation enabled.

Hop 1


The first hop doesn't get as far as being MPLS encapsulated so it behaves exactly the same as with propagation disabled, i.e. CE1 sends a packet towards CE2 with an IP TTL of 1, PE1 decrements that to zero and generates a TTL expired.

Hop 2


Now CE1 sends a packet towards CE2 with an IP TTL of 2:


PE1 receives the packet and decrements the TTL to 1. This is non-zero so the packet is forwarded using MPLS. Since IP - MPLS TTL propagation is enabled on PE1, the MPLS TTL is copied from the already decremented IP TTL value.

Now, when P1 receives the labelled packet it decrements the TTL and finds zero. P1 sends a TTL expired message towards the sender.

Hop 3


Hop 3 behaves in essentially the same manner as hop 2 - the MPLS TTL decrements to zero and a TTL expired message is sent to the sender.

Hop 4


Now CE1 sends a packet towards CE2 with an IP TTL of 4:


The behaviour to note here is that when PE2 pops the MPLS label, it copies the MPLS TTL back into the IP TTL. The IP TTL is now 1, as if it had been decremented at each hop. Finally, PE2 decrements the new IP TTL, finds that it is zero and sends a TTL expired.

Hop 5


Finally, CE1 sends a packet towards CE2 with an IP TTL of 5:


Essentially, PE1 decrements and copies into the MPLS TTL, each label switch hop decrements the MPLS TTL, PE2 copies the MPLS TTL back into the IP TTL, then forwards the packet on to CE2. CE2 realises that the packet is destined to itself and sends a response to CE1.

In this case, every hop is visible in the traceroute because the TTL behaves consistently and is decremented at each hop:

CE1#trace 2.2.2.2

Type escape sequence to abort.
Tracing the route to lo-0.ce2.wormtail.co.uk (2.2.2.2)

  1 ge-2-1.pe1.wormtail.co.uk (192.168.1.1) 6 msec 4 msec 3 msec
  2 xe-1-1.p1.wormtail.co.uk (10.1.100.2) 4 msec 4 msec 4 msec
  3 xe-1-1.p2.wormtail.co.uk (10.100.101.2) 5 msec 4 msec 5 msec
  4 ge-2-1.pe2.wormtail.co.uk (192.168.2.1) 2 msec 7 msec 4 msec
  5 ge-0.ce2.wormtail.co.uk (192.168.2.100) 6 msec 6 msec 4 msec
CE1#


Ah, a beautiful traceroute with all the hops on... now we can troubleshoot stuff :)

Hold On!


Now, this all seems good at first glance, but we've forgotten something. One of the greatest features of MPLS VRFs is that the intelligence is only needed around the edge. - The entry / exit points of the network (PEs) need to know about a VRF's routes, but the P routers switching labels in the middle only need to know how to reach the PEs and don't have any VRF routes at all.

So how do the P routers reply to traceroutes run within a VRF? They can't look in the VRF routing table to decide how to return the unreachable message to the host, as they don't have a VRF table. Maybe they see the label that is in use and somehow know what the return label should be? Nope - label-switched paths are unidirectional, so there's no way for a P node to know what labels it would need to attach to an unreachable message to get it back to the sender.

The answer is that the P routers play a clever little trick. The P router doesn't know how to reach the sender, but if it forwards a TTL expired message in the same way as it would have done the original packet eventually it will reach the egress PE which does know how to reach the sender. It sounds a little counter-intuitive, so let's have a diagram:




P1 receives a labelled packet, decrements the MPLS TTL and finds it to be zero. P1 doesn't know anything about the VRF where this traffic originated so it peels off all the labels and looks at the IP packet inside. It generates a TTL expired message ready to respond to the sender and copies the label stack from the original packet onto the message it has just generated. Then it label switches that instead of the original packet. This means it gets sent onwards to P2, then PE2, where the labels are removed and the IP packet can be routed. Seeing that the ICMP TTL expired message is destined for CE1, PE2 applies the appropriate label and sends it back into the network.

Indirect FECs


Now, as if that wasn't quirky enough, we will look at a very similar setup but where the destination of the traceroute is not on the PE's directly attached network but is one hop away:



For efficiency reasons, many vendors allocate a label per next-hop rather than a label per VRF. When the PE receives such a label, it automatically adds the encapsulation for the next hop device's MAC and throws the frame out of the appropriate interface without any kind of routing lookup. Hurray, we burnt a label to save a TCAM lookup.

This has the effect of extending the quirk in the previous example - because the PE does not make a routing decision for these packets, the U-turning of ICMP unreachables occurs one hop further along, at the CE:

No big deal, you might think, but remember: the difference between a PE and a CE is that the PE is normally in the provider's PoP on big fat 10 Gbps links while the CE could be out in the sticks somewhere on the end of a 2 Mbps DSL line... U-turning at the CE means that responses from label-switched hops near the beginning of a trace will appear to inherit any packet loss and / or latency being experienced by hops further along, up to and including the PE - CE link.

I am seeing loss / high latency on my traceroute as soon as it enters my provider's MPLS core. The provider says it's due to the destination site's link being congested but it appears way before that in the trace. What's going on?

When you traceroute to a prefix beyond a CE device, the unreachables from label-switched hops usually have to be U-turned by the CE device, which is on the other end of the congested link. Therefore label-switched hops will often show packet loss / high latency in a traceroute even though the issue is further downstream.


Weird Firewall Logs


If we consider MPLS into the datacentre, we can often tweak the example above and replace the CE router with a firewall:



Now, to recap:

  • Label-switched hops don't have routing knowledge to send a TTL expired message, so they attach the forward path label
  • The PE knows that the label corresponds to a prefix behind the firewall so it encapsulates with the firewall's MAC as the destination and throws it straight out of the interface
So the firewall receives a packet on its WAN interface which is destined for something out of its WAN interface. Most firewalls don't like that - Cisco ASA firewalls, for example, by default will not route any traffic back out of the same interface where it entered.

The traffic split-horizon rule can be overridden in config, but there is another issue to overcome - any self-respecting firewall will perform uRPF checks on incoming packets before too much else happens. Unless the firewall has a default route pointing out into MPLS land then we are going to have a problem when packets arrive from the carrier's label switched core, often sourced from a public IP.

Finally, unless your firewall rules are pretty sloppy, the traffic will be dropped by filter anyway!

Therefore, the firewall logs spoofed packets or at least dropped ICMP unreachable packets with a source IP of the carrier's PE and a destination IP of some device in a remote site.

So, mystery solved:

My firewall is attached to an MPLS provider's PE and it is complaining about spoofed packets. When I examine the packets I see they are ICMP unreachables destined to another site on the MPLS WAN. Why would I see this?

Your carrier is using label per next-hop with IP - MPLS TTL propagation and some remote device tried to traceroute towards an address behind your firewall.

Though, if you jumped to this section, you probably want to read the preceding sections to make sense of that!

MPLS Information in Traceroutes


If your router supports the feature, you may see MPLS labels quoted in a traceroute, as shown here:

PE1#trace 10.255.255.2
Type escape sequence to abort.
Tracing the route to lo-0.pe2.wormtail.co.uk (10.255.255.2)
VRF info: (vrf in name/id, vrf out name/id)
  1 xe-1-1.p1.wormtail.co.uk (10.1.100.2) [MPLS: Label 17 Exp 0] 1 msec 6 msec 2 msec
  2 xe-1-1.p2.wormtail.co.uk (10.100.101.2) [MPLS: Label 42 Exp 0] 6 msec 2 msec 2 msec
  3 xe-1-1.pe2.wormtail.co.uk (10.2.101.1) 4 msec 4 msec 3 msec
PE1#


Having the labels can be useful for troubleshooting, not that I've ever used them. The presence of the EXP marking could be useful for debugging QoS issues, at least. I always assumed that there was some kind of MPLS trick used to relay the information back but then I noticed you sometimes get it when you trace through an external provider's network which hands over to you on IP.

It's actually done using ICMP extensions for MPLS (RFC 4950) - basically the whole MPLS label stack as it arrived at the label switch router is wrapped up in an ICMP extension header and copied wholesale into the unreachable or TTL expired message. As long as the ICMP message gets to you, so will the label stack.

Most hosts don't support the extension, so you just see a normal-looking traceroute result, but recent routers generally do and can decode the information and present it to you.

Here's a sample packet inspected in Wireshark:

So:

What does it mean when my traceroute shows MPLS: whatever?

Your traceroute has run through an MPLS backbone somewhere, the MPLS label stack is included in the unreachable message for troubleshooting purposes.
Fairly self explanatory, but good to know.

References


Wikipedia article on Traceroute
RFC 4950 (MPLS extensions to ICMP)
Per CE and per-VRF labelling mode on Cisco IOS-XR

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Nice write up. Thanks for this nice explanation.
    Any documents for MPLS QoS to share.

    ReplyDelete