Saturday, 23 January 2016

A Down in the Weeds look at Route Distinguishers

I was recently involved in a discussion on reddit about VRF route targets and route distinguishers and I noticed that there was a lot of misinformation flying around. That doesn't really surprise me as a lot of the folks on there are learning and I've heard some jarring misconceptions on the topic come from  senior guys who have worked with MPLS for years. Most of the route target stuff was straightened out quite quickly and I will not get into any of that here, however the route distinguisher debate went on longer and covered some areas that seemed to be new or controversial to a lot of people.

The crux of the issue is that a lot of people believe the route distinguisher to be only locally significant - apparently there are many resources on the Internet which say this. I'll grant you that many are ambiguous, for example the first hit on Google for "route distinguisher and route target" says that "The route distinguisher has only one purpose, to make IPv4 prefixes globally unique. It is used by the PE routers to identify which VPN a packet belongs to". The well-respected packet life blog says "As its name implies, a route distinguisher (RD) distinguishes one set of routes (one VRF) from another. It is a unique number prepended to each route within a VRF to identify it as belonging to that particular VRF or customer." To be fair it goes on to clarify that "An RD is carried along with a route via MP-BGP when exchanging VPN routes with other PE routers", which suggests at the global significance.

In this post I hope to prove to anyone who is interested that route distinguishers are, in fact, both locally and globally significant and to demonstrate why this is important to understand.

Local Significance


If you've got this far then I assume you will already be familiar with what route targets and route distinguishers do, if not then I suggest you read up and play in the lab a while before venturing on.

The reason for needing a route distinguisher locally within a device is to extend the normal IPv4 prefixes that are known within each VRF in order to make them unique. Any locally learned IPv4 prefixes (connected, static or learned via an IPv4 routing protocol) are extended with the route distinguisher assigned to the VRF, as shown here:


It is also true that different PEs may use different route distinguishers for the same VRF without breaking anything:

PE1#show run vrf A
Building configuration...

Current configuration : 316 bytes
ip vrf A
 rd 100:2439
 route-target export 100:100
 route-target import 100:100
!
!
interface FastEthernet1/0
 ip vrf forwarding A
 ip address 192.168.1.1 255.255.255.0
 speed auto
 duplex auto
!
router bgp 100
 !
 address-family ipv4 vrf A
  redistribute connected
  redistribute static
 exit-address-family
!
end

PE1#show ip route vrf A

Routing Table: A


Gateway of last resort is not set

      192.168.1.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.1.0/24 is directly connected, FastEthernet1/0
L        192.168.1.1/32 is directly connected, FastEthernet1/0
B     192.168.24.0/24 [200/0] via 10.255.255.2, 00:04:03
PE1#



PE2#show run vrf A
Building configuration...

Current configuration : 317 bytes
ip vrf A
 rd 100:2458
 route-target export 100:100
 route-target import 100:100
!
!
interface FastEthernet1/0
 ip vrf forwarding A
 ip address 192.168.24.1 255.255.255.0
 speed auto
 duplex auto
!
router bgp 100
 !
 address-family ipv4 vrf A
  redistribute connected
  redistribute static
 exit-address-family
!
end

PE2#show ip route vrf A

Routing Table: A

Gateway of last resort is not set

B     192.168.1.0/24 [200/0] via 10.255.255.1, 00:03:44
      192.168.24.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.24.0/24 is directly connected, FastEthernet1/0
L        192.168.24.1/32 is directly connected, FastEthernet1/0
PE2#


So it's easy to see how the idea got started that RDs are only locally significant:

Route distinguishers don't need to match between devices in the same VRF in order for routes to be shared between them.

Global Significance


The first clue at the global significance of the route distinguisher is that it is carried in the MP-BGP updates:

PE1#show bgp vpnv4 unicast all 192.168.24.0/24 
BGP routing table entry for 100:2439:192.168.24.0/24, version 8
Paths: (1 available, best #1, table A)
  Not advertised to any peer
  Refresh Epoch 2
  Local, imported path from 100:2458:192.168.24.0/24 (global)
    10.255.255.2 (metric 3) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/28
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2458:192.168.24.0/24, version 7
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 2
  Local
    10.255.255.2 (metric 3) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/28
      rx pathid: 0, tx pathid: 0x0
PE1#

Interestingly, we have 2 different prefixes here. One is the original (100:2458:192.168.24.0/24) which we learned over the network, while the other is the same IPv4 prefix but prepended with the RD of the VRF which imports it (100:2439:192.168.24.0/24). If we imported it into multiple VRFs then we would have an additional copy for each RD used by the respective VRFs.

If the RD were only locally significant then why would the protocol designers send it? You may be thinking "otherwise you couldn't overlap prefixes!", but surely route targets would be enough to achieve this? If you heard a prefix of 10.0.0.0/8 announced with a route target imported by VRF A then you would import it into VRF A and not VRF B, if you heard a different announcement for 10.0.0.0/8 with a route target imported by VRF B then you would import it into VRF B and not VRF A.

That could kind of work, in theory, but it would essentially break the whole BGP paradigm as you would have multiple copies of the same prefix in use concurrently for different purposes. BGP likes to determine the best path and only offers that into the FIB. With a unique RD against each of the two 10.0.0.0/8 routes, BGP is able to do its best path determination and pass the two, now different, routes into their respective VRFs.

So the route distinguishers overcome that problem, but is that the only reason why they are carried in MP-BGP? That would be a fairly weak argument for global significance, but the best path point here touches on a much stronger case.

The Route Reflector Problem


One key thing to bear in mind which is often forgotten in the grand scheme of things is that the PE is not the only place where BGP best path calculations happen. Any MPLS network of even moderate scale will be using BGP route reflectors to keep the number of BGP sessions under control, and the route reflectors themselves perform a best path determination on the routes they receive before sending them out to their route reflector clients.

This extends the previous case to all the route reflector's clients, so essentially the entire AS. Let's take an example where the admin has been sloppy and has failed to keep RDs globally unique:



Notice that VRF A uses import / export RT 100:100, VRF B uses import / export RT of 100:200. The network administrator has tried to assign unique route distinguishers per VRF per device, but has made an error and overlapped the route targets used on PE1's VRF A and PE3's VRF B.

The two VRFs are completely distinct from one another and they are not even present on the same PEs. We can see that VRF A is only learning VRF A's routes and VRF B is only learning VRF B's routes:

PE1#show ip vrf
  Name                             Default RD          Interfaces
  A                                100:2439            Fa1/0
PE1#show ip route vrf A
Routing Table: A

Gateway of last resort is not set

      192.168.1.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.1.0/24 is directly connected, FastEthernet1/0
L        192.168.1.1/32 is directly connected, FastEthernet1/0
B     192.168.24.0/24 [200/0] via 10.255.255.2, 00:05:01
PE1#

PE2#show ip vrf
  Name                             Default RD          Interfaces
  A                                100:2458            Fa1/0
PE2#show ip route vrf A

Routing Table: A

Gateway of last resort is not set

B     192.168.1.0/24 [200/0] via 10.255.255.1, 00:05:15
      192.168.24.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.24.0/24 is directly connected, FastEthernet1/0
L        192.168.24.1/32 is directly connected, FastEthernet1/0
PE2#

PE3#show ip vrf
  Name                             Default RD          Interfaces
  B                                100:2439            Fa1/0
PE3#show ip route vrf B

Routing Table: B

Gateway of last resort is not set

      192.168.3.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.3.0/24 is directly connected, FastEthernet1/0
L        192.168.3.1/32 is directly connected, FastEthernet1/0
B     192.168.19.0/24 [200/0] via 10.255.255.4, 00:01:54
PE3#

PE4#show ip vrf
  Name                             Default RD          Interfaces
  B                                100:2895            Fa1/0
PE4#show ip route vrf B

Routing Table: B

Gateway of last resort is not set

B     192.168.3.0/24 [200/0] via 10.255.255.3, 00:02:21
      192.168.19.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.19.0/24 is directly connected, FastEthernet1/0
L        192.168.19.1/32 is directly connected, FastEthernet1/0
PE4#

Now, let's introduce an additional subnet on VRF A. It uses the same address space as VRF B but they are completely separate so that should be fine (right?!).

PE1(config)#ip route vrf A 192.168.3.0 255.255.255.0 192.168.1.10

Customer A is now happy as their new network is reachable over the VRF but all of a sudden we have customer B on the phone, complaining that their site (which used to work) is off the air. Looking into PE 4 we can see why:

PE4#show ip route vrf B

Routing Table: B

Gateway of last resort is not set

      192.168.19.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.19.0/24 is directly connected, FastEthernet1/0
L        192.168.19.1/32 is directly connected, FastEthernet1/0
PE4#

The route to 192.168.3.0/24 has disappeared! Why is that? Looking on the route reflector gives us the answer:


RR#show bgp vpnv4 uni all 192.168.3.0/24
BGP routing table entry for 100:2439:192.168.3.0/24, version 14
Paths: (2 available, best #1, no table)
  Advertised to update-groups:
     3
  Refresh Epoch 1
  Local, (Received from a RR-client)
    10.255.255.1 (metric 3) from 10.255.255.1 (10.255.255.1)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      mpls labels in/out nolabel/25
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    10.255.255.3 (metric 3) from 10.255.255.3 (10.255.255.3)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:100:200
      mpls labels in/out nolabel/28
      rx pathid: 0, tx pathid: 0
RR#

Both 192.168.3.0/24 prefixes are being advertised with the same RD but different route targets. The route reflector has, therefore, seen the two "identical" prefixes and has chosen a best path - for want of a better metric it has chosen based on lowest next hop IP (PE1, VRF A):



Since the route reflector only advertises best paths to its clients, that means nobody gets to hear about the route from PE 3, VRF B. The route advertised from PE1 has a route target of 100:100, which doesn't match any VRFs on PE4 so it just discards the route leaving it with no way to reach the 192.168.3.0/24 network.

This proves that:

If you fail to apply globally unique route distinguishers on at least a per VRF basis, changes in one VRF can impact on another. This is irrespective of whether there are devices common to the two VRFs and occurs even when their route targets are completely different.

Policy at the PE

A similar but more subtle example of where globally unique route distinguishers are a benefit is in the case where you have a multi-homed network connected or routed via two PEs for resilience.
We want to use the the purple link to reach this prefix when it's available for administrative reasons (say the purple link is cheaper, or faster). Routes learned over the purple link are tagged with community 100:123 to allow upstream PEs to recognise this. Let's compare the case where both PEs use the same RD vs. the case where each PE uses a unique RD for the same VRF. Firstly, the same RD:



PE1 and PE2 are set to use the same RD. PE3 wants to use purple routes so it is set up with a policy to favour anything with a 100:123 community attached, as follows:

ip vrf A
 rd 100:2512
 import map A-import-map
 route-target export 100:100
 route-target import 100:100
!
route-map A-import-map permit 10
 match community purple
 set local-preference 200
!
route-map A-import-map permit 20
 set local-preference 100
!
ip community-list standard purple permit 100:123

For some reason, though, our traffic all goes out via the orange link. What is happening here is that the route reflector is again receiving  two identical prefixes - this does not cause a reachability problem as both prefixes reside within the same VRF, but it does mean that the route reflector makes a best path determination and discards one of the routes. PE3 only receives one route so its policy has to take what it can get:

PE3#show bgp vpnv4 uni all 192.168.200.0/24
BGP routing table entry for 100:2439:192.168.200.0/24, version 61
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 4
  200
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2512:192.168.200.0/24, version 68
Paths: (1 available, best #1, table A)
  Not advertised to any peer
  Refresh Epoch 4
  200, imported path from 100:2439:192.168.200.0/24 (global)
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0x0
PE3#

Clearly this is not doing what we want. The local VRF table only has one option, and that's the orange route. Let's try the same thing but with unique RDs per VRF per PE:



Now we see this at PE3:

PE3#show bgp vpnv4 uni all 192.168.200.0/24
BGP routing table entry for 100:2439:192.168.200.0/24, version 61
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 4
  200
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2458:192.168.200.0/24, version 62
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 4
  200
    10.255.255.2 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Community: 6553723
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/29
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2512:192.168.200.0/24, version 64
Paths: (2 available, best #1, table A)
  Not advertised to any peer
  Refresh Epoch 4
  200, imported path from 100:2458:192.168.200.0/24 (global)
    10.255.255.2 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 200, valid, internal, best
      Community: 6553723
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/29
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 4
  200, imported path from 100:2439:192.168.200.0/24 (global)
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0
PE3#

Now we can see that two routes are received (purple and orange) and our route map has taken effect, pushing the purple route up to a better local preference and causing it to be selected into the VRF B table.

In summary, if you use the same route distinguisher at more than one point where the same IP prefix is learned, the best path determination will occur at the route reflector, not the receiving PE. This best path determination is likely to be quite coarse and applying per-VRF policies on route reflectors is inappropriate. Using unique RDs ensures that multiple copies of the same IP prefix can be learned by other PEs, allowing the best path determination to be done by the receiving PE using arbitrary local policies on a per-VRF basis.

Fast Failover


The final example is one of the most widely seen use cases for unique RD per VRF per PE. Let's take a look at the failover times for a route to move between PE1 and PE2 in the following scenario:



In the case of matching RDs, only one route for the destination is learned throughout the network so when a failure occurs a series of BGP updates need to occur before traffic can switch paths. In a real environment this chain of updates may take time. In a scaled environment (and for illustrative purposes in this lab), there may be hierarchical route reflectors and these may be configured with an update delay. Here is an example failover with two-tiered route reflectors and an update delay of 10s:

CE2#ping 1.1.1.1 repeat 100000

Type escape sequence to abort.
Sending 100000, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.......UUUUUUUUUUUUUUUUU
UUUUU.UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU.UUUUUUUUUUUUUUUU
UUUUUUUUUUUUUUUUUUUUUUUUUUU.UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
UUUUU.UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU.!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
Success rate is 37 percent (130/344), round-trip min/avg/max = 4/8/152 ms
CE2#

This failover takes around 30 seconds due to cascading updates being batched and delayed multiple times. The "U" marks above show that the edge PE has no route to the destination, due to having received the withdraw from the primary path but not yet having received the advertisement from the standby. The diagram below shows the BGP updates which need to take place before routing converges to the standby path:


Compare this to the output when unique RDs are set, meaning that the alternate path is already learned throughout the network but is simply not selected by the ingress PE:


 
CE2#ping 1.1.1.1 repeat 100000

Type escape sequence to abort.
Sending 100000, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.......!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
Success rate is 94 percent (131/139), round-trip min/avg/max = 5/9/16 ms
CE2#

As you can see, the failover is much faster, around 10-15s, and there are no unreachables as the PE always has a route in its table (even if it is a stale one for a while).

Super Fast Failover


This can be improved further by using label per VRF mode in addition to unique RDs. Without going into too much detail, the standard mode for Cisco IOS is to generate a label per prefix. The LFIB of the generating PE will have an entry saying "if I receive label X, I will stick encap Y on it and throw it out of interface Z". This can be changed as follows:

PE2(config)#mpls label mode vrf A protocol bgp-vpnv4 per-vrf 

In label per VRF mode, the same label is advertised for all prefixes advertised from within a particular VRF - the corresponding LFIB entry essentially says "rip off the label and route the packet that follows". When in label per VRF mode, we don't wait for any BGP updates at all because the egress PE where the primary link just failed can instantly use the standby route, which it already knows thanks to the unique RDs. Traffic gets U-turned back into the MPLS network while the BGP convergence occurs, but the traffic at least arrives:



Traffic temporarily hops via the primary PE into the secondary, restoring connectivity while BGP takes its sweet time to converge. Once the routing updates have propagated, traffic will go directly to the secondary PE. Failover times here are much more impressive:

CE2#ping 1.1.1.1 repeat 100000

Type escape sequence to abort.
Sending 100000, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 99 percent (200/201), round-trip min/avg/max = 5/7/12 ms
CE2#

One ping lost / two seconds to fail over. Much better, and only possible with unique RDs!

Using unique RDs allows for much faster failover times, due to decreased numbers of BGP updates being required to converge following failures. This is particularly true when using label per VRF mode, since egress PEs can U-turn traffic without waiting for any BGP convergence at all.

6 comments:

  1. Hi Foeh,

    I am currently working my way to get my CCIE and I can honestly say this post is the only post I could find on the internet which clearly explains for what are RDs *actually* used for. Nobody really explained it in this detail, possibly because they themselves have only a superficial and "mechanic" understanding of it.

    I have only one issue that I don't seem to overcome. When you say:

    "That could kind of work, in theory, but it would essentially break the whole BGP paradigm as you would have multiple copies of the same prefix in use concurrently for different purposes. BGP likes to determine the best path and only offers that into the FIB. With a unique RD against each of the two 10.0.0.0/8 routes, BGP is able to do its best path determination and pass the two, now different, routes into their respective VRFs."

    I don't really understand what you mean here. How would this "break" the BGP paradigm exactly? When you configure multiple VRFs on a router, BGP keeps a BGP table for each VRF. Let's assume for a moment that RDs do not exist, so you only have route-targets. Each route you receive from a neighbor (which on Cisco you configure per-VRF) sends you routes with a unique route-target even if the route itself is not unique. BGP keeps all of these routes in a common "waiting area" and then you use import/export route-targets to get the routes that you want in your VRF.

    How exactly would this break BGP?

    If you could go a little deeper on this I would be really grateful.

    Thanks a lot.

    ReplyDelete
    Replies
    1. Hi,

      Thank you for your kind words and for a good, tough question.

      As you say, all the routes are kept in a common waiting area, i.e. the VPNv4 table. If it were not for the RD then two VRF's routes for the same IPv4 prefix would be the same destination and residing in the same table - BGP will always try to decide a best path, even if it is using some arbitrary method to tie-break, and that is fundamental to its operation (remember BGP is primarily designed to scale).

      So let's consider what happens if we don't tie-break: If we were to keep multiple copies of (basically) the same route then where would we draw the line? Would we keep 2 routes that are identical but with a different MED? Different AS_PATH? Different communities? We can't use having different next-hops as a determinant since one PE could have the same route in 2 VRFs. Do we advertise all the copies we have to our peers or just the best?

      How would updates work? i.e. if the MED changes, BGP currently just sends an update with the new value and this overwrites the previous entry but under this scheme you would then have 2 routes. Do you have to withdraw the old one and then re-advertise? The implications would be widespread. Table sizes would increase dramatically, plus you would have a lot more updates and recalculations (announcements would no longer be constrained to the area where the update causes the best path to change). The RIB would also have to offer non-best routes to the FIB... This is before we even think about route reflectors!

      Maybe I'm being melodramatic. In any case, we are lucky to have the RD so we don't need to worry about any of this.

      Good luck with your CCIE!

      Foeh

      Delete
    2. Thank you Foeh, your reasoning makes perfect sense.

      Have a great day!

      Delete
  2. What is the network tool you used to create these diagrams? Is it creately ?

    ReplyDelete
    Replies
    1. I'm afraid I have to be very boring and say that they were made in Visio. Sorry!

      Delete
  3. THE BEST!!!!
    Found this after 3 hours of struggle
    Thank you

    ReplyDelete