Sunday, 7 May 2017

Building Chassis Cluster on Juniper SRX

For a while I've wanted to post about Juniper SRX chassis cluster - I had to do some in-depth troubleshooting on it once and found that the information I needed was scattered across several documents and proved tricky to bring together.

Anyway, a couple of weeks ago I found the time to create a YouTube video showing how to do the basic setup of a chassis cluster, how to tell it is working, how to manually fail over, how to recover a disabled node and finally how to remove the chassis cluster configuration. The video is here:


However, at just over half an hour long it's a bit unwieldy if you only want some of the information! For easy searching, I've decided to write this accompanying blog post.

Chassis Cluster Concepts


Chassis cluster is Juniper's hardware HA mechanism for the SRX series. The chassis cluster mechanism keeps configurations and connection state replicated between devices so that, in the event of a failure, a standby device can take over the functions of the cluster with little downtime.

When a pair of devices are configured for chassis cluster, a raft of new port types come into play:

  • Control (CTRL) - The key to chassis cluster - responsible for replicating configuration and state, liveness checks and other housekeeping tasks. The CTRL port is usually chosen by JunOS and is fixed for a particular hardware type.
  • Fabric (fab0 / fab1) - Used to carry traffic between devices when a port goes down on the active device and traffic enters the standby (or for ports configured only on one member device). The fabric port is also used to avoid split brain in the event of a control link failure - if CTRL goes away but fabric remains, the two devices know not to both go active. Fabric ports are configured manually and there may be up to 2 pairs.
  • Out of Band Management (fxp0 and fxp1) - Used to manage the individual devices. The fxp interfaces do not fail over when there is a mastership change and always belong to the specific member, allowing management access to both devices irrespective of which is master. fxp ports are usually chosen by JunOS and will be fixed for a given hardware platform.
  • Redundant Ethernet (reth) - When a port on each device is configured for the same purpose, it is called a redundant Ethernet or "reth" interface. The active member of a reth moves as mastership changes and when there are connectivity failures. An arbitrary number of reth interfaces may be configured, depending on how many ports are available.

The diagram below shows our topology:


Preparing for Chassis Cluster


When setting up a chassis cluster I strongly recommend completely blowing away the configuration on your devices first. SRX 100 (and potentially other platforms) do not let you have Ethernet switching configuration while in chassis cluster mode, so even the vanilla factory default configuration can upset chassis cluster. Just clear everything, set the root password then commit:

root> configure
Entering configuration mode

[edit]
root# delete
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes

[edit]
root# set system root-authentication plain-text-password
New password:
Retype new password:


root# commit and-quit
commit complete
Exiting configuration mode


root>

Once this is done on both devices, we can begin the chassis cluster configuration.

Enabling Chassis Cluster


The first task with chassis cluster is to choose a cluster ID (from 0 to 255, however 0 means "no chassis cluster"). The cluster ID must match between members. In most cases you can just pick 1 but when there are multiple clusters on the same layer 2 network you will need to use different cluster IDs for each cluster.

At the same time the cluster ID is applied, we must also apply the node ID. Each device must have a unique node ID within the cluster, in fact 0 and 1 are the only valid node IDs so in essence one device must be node 0 and the other node 1.

In order to take effect, the devices must be rebooted and this can be requested by adding the "reboot" keyword:

root> set chassis cluster cluster-id 1 node 0 reboot
Successfully enabled chassis cluster. Going to reboot now.

Note: this is done from the exec CLI context, not the configuration context. Perform the same command on the second device but using node 1 in place of node 0.

Once the devices have rebooted, you will notice that the prompt has changed to indicate the node number and activity status ("primary"or "secondary"). For a brief period following boot, the status on both devices will show as "hold" - during this time the device refrains from becoming active while it checks to see if there is another cluster member already serving.

{primary:node0}
root>


and

{secondary:node1}
root>


Once the devices have entered this state, configuration applied to one device will replicate to the other (in either direction). From here the rest of the chassis cluster configuration can be applied.

Note: If the devices both become active, check that their control link is up.

Redundancy Groups


Redundancy groups are primarily used to bundle resources that need to fail over together. Resources in the same redundancy group are always "live" on the same member firewall as each other, while different redundancy groups can be active on the same or different devices to one another. This allows some resources to be active on one cluster member while others are active on the opposite device.

To begin with there is only one redundancy group, group 0, which decides which firewall is the active routing engine. As a best practice you should always influence which device will become master in the event of a simultaneous reboot. This is done by setting the priority as follows:

{primary:node0}[edit]
root# set chassis cluster redundancy-group 0 node 0 priority 100
{primary:node0}[edit]
root# set chassis cluster redundancy-group 0 node 1 priority 50


Note that group 0 is not and cannot be pre-emptive, i.e. a higher priority only takes effect when there is an election (i.e. at boot time), not if a higher priority device appears while a lower priority device is active.

Later on we will configure redundant Ethernet interfaces, which is where groups 1 and upwards come into play. We will create group 1 ready for the interfaces - note this can be set to pre-empt if you like:

{primary:node0}[edit]
root# set chassis cluster redundancy-group 1 node 0 priority 100
{primary:node0}[edit]
root# set chassis cluster redundancy-group 1 node 1 priority 50


Here I've set the priorities the same between the groups - it's very possible to have group 0 active on one device and group 1 active on the other, however it's messy, lots of traffic has to traverse the fabric link(s) and it increases your exposure - in that state, failure of either firewall would have some impact on connectivity whereas when they are aligned you could lose the standby with no impact to service.

For this reason, I don't configure pre-empt - that way all groups should be active on the same device unless manually tweaked. If you'd rather it be revertive, use this command:

{primary:node0}[edit]
root@SRX-top# set chassis cluster redundancy-group 1 preempt


Applying Configuration to Single Devices


While it's useful to have exactly the same configuration on the firewalls for most things, it is very useful to be able to keep some configuration unique per device. Good examples of this are the device hostname and management IP address.

This is achieved using groups called "node0" and "node1" which are applied per device using a special macro:

{primary:node0}[edit]
root# set groups node0 system host-name SRX-top
{primary:node0}[edit]
root# set groups node0 interfaces fxp0 unit 0 family inet address 172.16.1.1/24
{primary:node0}[edit]
root# set groups node1 system host-name SRX-bottom
{primary:node0}[edit]
root# set groups node1 interfaces fxp0 unit 0 family inet address 172.16.1.2/24

{primary:node0}[edit]
root# set apply-groups ${node}


What we've done here is to define a node0 group which defines the hostname and out of band IP for node 0, then the same for node 1. Finally, the apply group uses the "${node}" macro to apply the node0 group to node 0 and node1 to node 1.

Defining Fabric Interfaces


In order to allow traffic to traverse between the clustered devices, at least one fabric interface per node must be configured (up to two per node is allowed). In our case we will configure this up on port 5 so that it is adjacent to the other special purpose ports. There are two "fab" virtual interfaces on the cluster, fab0 associated to node 0 and fab1 associated with fab1:

{primary:node0}[edit]
root# set interfaces fab0 fabric-options member-interfaces fe-0/0/5
{primary:node0}[edit]
root# set interfaces fab1 fabric-options member-interfaces fe-1/0/5


Note that ports on node 0 are denoted by fe-0/x/x while ports on node 1 are denoted by fe-1/x/x. If you want a dual fabric (either for resilience or to cope with a lot of inter-chassis traffic in the event of a failover) then just add the second interface on each side in exactly the same way.

Redundant Ethernet (reth) Interfaces


In order to be highly available, each traffic interface needs to have a presence on node 0 and node 1 (otherwise interfaces would be lost when a failover occurs). In the SRX chassis cluster world, this pairing of interfaces is done using a redundant Ethernet or "reth" virtual interfaces.

The first step in configuring redundant Ethernet interfaces is to decide how many are allowed (similar to ae interfaces):

{primary:node0}[edit]
root# set chassis cluster reth-count 5


Next we configure the member interfaces that will belong to each reth (note: on higher end SRX this will be gigether-options or ether-options rather than fastether-options):

{primary:node0}[edit]
root# set interfaces fe-0/0/0 fastether-options redundant-parent reth0
{primary:node0}[edit]
root# set interfaces fe-1/0/0 fastether-options redundant-parent reth0


...and assign the reth to a redundancy group (mentioned earlier):

{primary:node0}[edit]
root# set interfaces reth0 redundant-ether-options redundancy-group 1


At this point the reth can be configured like any other routed interface - units, address families, security zones, etc. are all used in exactly the same way as a normal port.

For reference, here's the full configuration as used in the video:

set groups node0 system host-name SRX-top
set groups node0 interfaces fxp0 unit 0 family inet address 172.16.1.1/24
set groups node1 system host-name SRX-bottom
set groups node1 interfaces fxp0 unit 0 family inet address 172.16.1.2/24
set apply-groups "${node}"
set system root-authentication encrypted-password "$1$X8eRYomW$Wbxj8V0ySW/5dQCXrkYD70"
set chassis cluster reth-count 5
set chassis cluster redundancy-group 0 node 0 priority 100
set chassis cluster redundancy-group 0 node 1 priority 50
set chassis cluster redundancy-group 1 node 0 priority 100
set chassis cluster redundancy-group 1 node 1 priority 50
set interfaces fe-0/0/0 fastether-options redundant-parent reth0
set interfaces fe-0/0/1 fastether-options redundant-parent reth1
set interfaces fe-1/0/0 fastether-options redundant-parent reth0
set interfaces fe-1/0/1 fastether-options redundant-parent reth1
set interfaces fab0 fabric-options member-interfaces fe-0/0/5
set interfaces fab1 fabric-options member-interfaces fe-1/0/5
set interfaces reth0 redundant-ether-options redundancy-group 1
set interfaces reth0 unit 0 family inet address 10.10.10.10/24
set interfaces reth1 redundant-ether-options redundancy-group 1
set interfaces reth1 unit 0 family inet address 192.168.0.1/24
set security nat source rule-set trust-to-untrust from zone trust
set security nat source rule-set trust-to-untrust to zone untrust
set security nat source rule-set trust-to-untrust rule nat-all match source-address 0.0.0.0/0
set security nat source rule-set trust-to-untrust rule nat-all then source-nat interface
set security policies from-zone trust to-zone untrust policy allow-all match source-address any
set security policies from-zone trust to-zone untrust policy allow-all match destination-address any
set security policies from-zone trust to-zone untrust policy allow-all match application any
set security policies from-zone trust to-zone untrust policy allow-all then permit
set security zones security-zone untrust interfaces reth0.0
set security zones security-zone trust interfaces reth1.0


Checking and Troubleshooting


Now that the configuration is in place,  we should verify its status. There are a number of commands we can use to check the operation of chassis cluster, probably the most frequently used one would be "show chassis cluster status":

{primary:node0}
root@SRX-top> show chassis cluster status
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring
    SP  SPU monitoring              SM  Schedule monitoring

Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 1
node0  100      primary        no      no       None
node1  50       secondary      no      no       None

Redundancy group: 1 , Failover count: 1
node0  100      primary        no      no       None
node1  50       secondary      no      no       None


From this you can see the configured priority for each node against each redundancy group, along with its operational status (i.e. whether it is acting as the primary or secondary).

Another useful command to verify proper operation is "show chassis cluster statistics":

{primary:node0}
root@SRX-top> show chassis cluster statistics
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 424101
        Heartbeat packets received: 424108
        Heartbeat packet errors: 0
Fabric link statistics:
    Child link 0
        Probes sent: 834746
        Probes received: 834751
    Child link 1
        Probes sent: 0
        Probes received: 0
Services Synchronized:
    Service name                              RTOs sent    RTOs received
    Translation context                       0            0
    Incoming NAT                              0            0
<snip>

In this output we would expect to see the number of control link heartbeats to be steadily increasing over time (more than 1 per second) and the same for the probes. Usefully, if you have dual fabric links then it shows activity for each separately so that you can determine the health of both.

One of the most useful commands available (which sometimes in older versions was not visible in the CLI help and would not auto-complete but would still run if typed completely) is "show chassis cluster information":

{primary:node0}
root@SRX-top> show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: primary, Weight: 255

        Time            From           To             Reason
        Apr 28 15:06:58 hold           secondary      Hold timer expired
        Apr 28 15:07:01 secondary      primary        Better priority (1/1)

    Redundancy Group 1 , Current State: primary, Weight: 255

        Time            From           To             Reason
        May  3 12:59:58 hold           secondary      Hold timer expired
        May  3 12:59:59 secondary      primary        Better priority (100/50)

Chassis cluster LED information:
    Current LED color: Green
    Last LED change reason: No failures
Control port tagging:
    Disabled

node1:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: secondary, Weight: 255

        Time            From           To             Reason
        Apr 28 15:06:34 hold           secondary      Hold timer expired

    Redundancy Group 1 , Current State: secondary, Weight: 255

        Time            From           To             Reason
        May  3 12:59:54 hold           secondary      Hold timer expired

Chassis cluster LED information:
    Current LED color: Green
    Last LED change reason: No failures
Control port tagging:
    Disabled


The brilliant part about this command is that it shows you a history of exactly when and why the firewalls last changed state. The "Reason" field is really quite explanatory, giving reasons such as "Manual failover".

Manual Failover


Failing over the SRX chassis cluster is not quite as straightforward as with some other vendors' firewalls - for a start there are at least 2 redundancy groups to fail over, but in addition to that the forced activity is 'sticky', i.e. you have to clear out the forced mastership to put the cluster back to normal.

So let's say we have node0 active on both redundancy groups:

{secondary:node1}
root> show chassis cluster status 
<snip> 
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       secondary      no      no       None           

Redundancy group: 1 , Failover count: 2
node0  100      primary        no      no       None           
node1  50       secondary      no      no       None          

We can fail over redundancy group 1 as follows:

{secondary:node1}
root> request chassis cluster failover redundancy-group 1 node 1 
node1:
--------------------------------------------------------------------------
Initiated manual failover for redundancy group 1

Now when we check, we can see that node1 is the primary as expected but also its priority has changed to 255 and the "Manual" column shows "yes" for both devices. This indicates that node1 is forced primary and, effectively, can't be pre-empted even if that is set up on the group:

{secondary:node1}
root> show chassis cluster status 
<snip> 
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       secondary      no      no       None           

Redundancy group: 1 , Failover count: 3
node0  100      secondary      no      yes      None           
node1  255      primary        no      yes      None           

If you have pre-empt enabled on the redundancy-group then you will need to leave it like this for as long as you want node1 to remain active. If not then you can clear the forced mastership out immediately:

root> request chassis cluster failover reset redundancy-group 1 
node0:
--------------------------------------------------------------------------
No reset required for redundancy group 1.

node1:
--------------------------------------------------------------------------
Successfully reset manual failover for redundancy group 1

Just remember to do this for both (or all) redundancy groups if you want to take node0 out of service for maintenance.

Fabric Links and Split Brain


In addition to transporting traffic between cluster members when redundancy-groups are active on different members, the fabric link or links carry keepalive messages. This not only ensures that the fabric links are usable but is also used as a method to prevent "split brain" in the event that the single control link goes down.

The logic that the SRX uses is as follows:

If the control link is lost but fabric is still reachable, the secondary node is immediately put into an "ineligible" state:

{secondary:node1}
root> show chassis cluster status    
<snip>
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  0        lost           n/a     n/a      n/a            
node1  50       ineligible     no      no       None           

Redundancy group: 1 , Failover count: 4
node0  0        lost           n/a     n/a      n/a            
node1  50       ineligible     no      no       None           

If the fabric link is also lost during the next 180s then the primary is considered to be dead and the secondary node becomes primary. If the fabric link does is not lost during the 180s window then the standby device switches from "ineligible" to "disabled". Even if the control link recovers, as shown here (the partner node changes from "lost" to "primary"):

{ineligible:node1}
root> show chassis cluster status    
<snip>
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       ineligible     no      no       None           

Redundancy group: 1 , Failover count: 4
node0  100      primary        no      no       None           
node1  50       ineligible     no      no       None           

Once 180s passes, the device will still go into a "disabled" state:

{ineligible:node1}
root> show chassis cluster status         
<snip> 
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       disabled       no      no       None           

Redundancy group: 1 , Failover count: 4
node0  100      primary        no      no       None           
node1  50       disabled       no      no       None           

The output of "show chassis cluster information" makes it quite clear what happened:

        May  7 16:46:38 secondary      ineligible     Control link failure
        May  7 16:49:38 ineligible     disabled       Ineligible timer expired

From the disabled state, the node can never become active. To recover from becoming "disabled", the affected node must be rebooted (later releases allow auto recovery, but this seems to just reboot the standby device anyway and that idea rubs me up the wrong way).

Removing Chassis Cluster


In order to remove chassis cluster from your devices, just go onto each node and run:

root@SRX-bottom> set chassis cluster disable    

For cluster-ids greater than 15 and when deploying more than one
cluster in a single Layer 2 BROADCAST domain, it is mandatory that
fabric and control links are either connected back-to-back or
are connected on separate private VLANS.

Also, while not absolutely required, I strongly recommend:

{secondary:node1}
root@SRX-bottom> request system zeroize 
warning: System will be rebooted and may not boot without configuration
Erase all data, including configuration and log files? [yes,no] (no) yes 

error: the ipsec-key-management subsystem is not responding to management requests
warning: zeroizing node1

Bye, bye, chassis cluster!

Thursday, 27 April 2017

Hacky on-the-spot netflow

Sometimes it would be really useful to see what flows are active over a link, i.e. what is talking to what, but you don't have a netflow collector available (or the time to set one up). I was in this situation recently and discovered that it's possible to get most of the useful information out of netflow using just a Linux box and some scripting. Easy peasy.

1 - Configure Netflow on the Router / Firewall


There's not much to say about this, it varies from platform to platform, vendor to vendor, but you just need to set the device up to send Netflow version 5 to your "collector" box.

A couple of examples are here

Older IOS (12.x):


mls flow ip interface-full
ip flow-export version 5
ip flow-export destination x.x.x.x yyyy
interface Gix/x
  ip flow ingress
  mls netflow sampling

Juniper SRX:


set system ntp server pool.ntp.org
set interfaces fe-0/0/1 unit 0 family inet sampling input
set interfaces fe-0/0/1 unit 0 family inet sampling output
set forwarding-options sampling input rate 1024
set forwarding-options sampling family inet output flow-server x.x.x.x port yyyy
set forwarding-options sampling family inet output flow-server x.x.x.x version 5


2 - Capture the Netflow Packets


Use tcpdump / tshark / wireshark / whatever to capture the packets on the "collector" box. The only thing to be careful of is that you don't allow tcpdump to truncate / slice the packets, e.g.:

tcpdump -i eth0 -s 0 -w capfile.cap udp port yyyy and not icmp

The capture can be done on any box which your sampler can forward traffic to and from which you can retrieve the file back to a *nix box with tshark installed. If you have tshark installed on the capture box then you can also use it to dump the flows out.

3 - Dump the Flow Data with tshark


This can be done on the collector box if tshark is available or can be done elsewhere if not. Basically we ask tshark to dump out verbose packet contents then use standard *nix utilities to mangle the output:

tshark -r capfile.cap -nnV | grep -e '       \(...Addr:\|...Port:\|Protocol:\)' | tr '\n' ' ' | sed 's/       SrcAddr:/\n/g;' | awk '{print $1 "\t" $4 "\t" $7 "\t" $9 "\t" $10 $11}' | sed 's/Protocol:6/TCP/g; s/Protocol:17/UDP/g; s/Protocol:1/ICMP/g;'

This prints out the flows as reported by your router / firewall in tab separated columns as follows: Source IP, Destination IP, Source port, Destination port, IP Protocol

For example:

192.168.10.10  10.10.100.99    24010   53      UDP
192.168.8.14   10.10.100.4     0       771     ICMP
172.16.44.9    10.10.100.86    54832   443     TCP



Of course this can be tailored to match whatever fields interest you (for example you may want to include ingress and egress interfaces to show traffic direction or byte counts to get an idea of flow size) but this will cover the basics.

Saturday, 11 March 2017

Setup and Troubleshooting of IPSec VPN between AWS and Juniper SRX Firewall

Setting up IPSec VPNs in AWS is pretty simple - virtually all the work is done for you and they even provide you with a config template to blow onto your device. There are only a couple of points to remember while doing this to make sure you get a good, working VPN at the end - in this post I'll quickly show the setup and how to troubleshoot some of the more likely snags that you could run into.

Setup - AWS End


To set up an IPSec VPN into an AWS VPC you require 3 main components - the Virtual Private Gateway (VPG), the Customer Gateway (CG) and the actual VPN connection.


The VPG is is just a named device, like an IGW. Create a VPG and name it.


Attach the VPG to your VPC so that it can be used.

Next we need to create a Customer Gateway (CG) profile:


This defines the parameters of the opposite end of the tunnel (i.e. our SRX firewall), most key being the IP address. For our simple case we'll just use static routing but BGP is also an option.


Next we create a VPN connection profile:


The VPN connection profile basically ties the other two objects together and defines the IP prefix(es) that will be tunnelled over IPSec to the other end.

Once this is created you can download configuration templates for various device types, in our case we want Juniper ERX:


At this point the AWS VPN configuration is basically complete. Download the configuration template and open it in something which handles UNIX style end of line markers (i.e. Notepad++, Wordpad) ready to configure the firewall end.

Setup - Juniper SRX End


Assuming some sort of working basebuild, the Juniper SRX configuration is almost a straight copy and paste from the configuration templates. There are a couple of key exceptions:

  • IKE interface binding (lines 54 & 173 at time of writing) - you should override this with the "outside" interface of your firewall. For xDSL this will probably be pp0.0, for Ethernet based devices it could be fe-x/x/x.0 or vlan.x
  • Routing (lines 134 & 253 at time of writing) - the config template does not contain the actual routes you will need, or even a sensible default such as 172.31.0.0/16 to cover the default VPC.
  • It's probably worth un-commenting the traceoptions lines to give some debugging output in the event of tunnel problems.
Once the template is applied you may have the desired connectivity, if not then read on...

Troubleshooting


Firstly, we need to check phase 1 of the VPN (IKE) is up:

root@Lab-SRX> show security ike security-associations
Index   State  Initiator cookie  Responder cookie  Mode           Remote Address
4862528 UP     53a352fbe8fbf11a  26d9edf2e3a2d371  Main           52.56.146.67
4862529 UP     901117dbc101ce98  a1c21584e8cd22e2  Main           52.56.194.28


This shouldn't be a problem as the template basically takes care of all the proposals and whatnot being correct. If there aren't 2 SAs in an UP state then check you put the right IP address into the AWS Customer Gateway configuration.


Next, we check IPSec is up:

root@Lab-SRX> show security ipsec security-associations
  Total active tunnels: 2
  ID    Algorithm       SPI      Life:sec/kb  Mon lsys Port  Gateway
  <131073 ESP:aes-cbc-128/sha1 49d38075 3543/ unlim U root 500 52.56.146.67
  >131073 ESP:aes-cbc-128/sha1 b3b5474b 3543/ unlim U root 500 52.56.146.67
  <131074 ESP:aes-cbc-128/sha1 4df0b3b 3543/ unlim U root 500 52.56.194.28
  >131074 ESP:aes-cbc-128/sha1 2e1e40aa 3543/ unlim U root 500 52.56.194.28


This should show two tunnels in each direction (direction denoted by the "<" and ">"). Again, very little is likely to go wrong here as the template should cover everything.

Assuming that's good, we would now check IPSec statistics:

root@Lab-SRX> show security ipsec statistics
ESP Statistics:
  Encrypted bytes:             5472
  Decrypted bytes:             3024
  Encrypted packets:             36
  Decrypted packets:             36
AH Statistics:
  Input bytes:                    0
  Output bytes:                   0
  Input packets:                  0
  Output packets:                 0
Errors:
  AH authentication failures: 0, Replay errors: 0
  ESP authentication failures: 0, ESP decryption failures: 0
  Bad headers: 0, Bad trailers: 0

root@Lab-SRX>



Ideally we want to see both encrypted and decrypted packets - if one way isn't working then probably the (would be) sender is at fault. Verify that the configuration template was fully applied.

Next we check the secure tunnel interface statistics - a good idea is to ping other end of the tunnel to see if the counters increase:

root@Lab-SRX> show interfaces st0 | match packets
    Input packets : 32
    Output packets: 0
    Input packets : 32
    Output packets: 0

root@Lab-SRX> ping 169.254.66.229
PING 169.254.66.229 (169.254.66.229): 56 data bytes
64 bytes from 169.254.66.229: icmp_seq=0 ttl=254 time=12.627 ms
64 bytes from 169.254.66.229: icmp_seq=1 ttl=254 time=12.342 ms
64 bytes from 169.254.66.229: icmp_seq=2 ttl=254 time=12.169 ms
64 bytes from 169.254.66.229: icmp_seq=3 ttl=254 time=12.314 ms
^C
--- 169.254.66.229 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 12.169/12.363/12.627/0.166 ms

root@Lab-SRX> show interfaces st0 | match packets
    Input packets : 36
    Output packets: 4
    Input packets : 32
    Output packets: 0

root@Lab-SRX>

A working ping to the other end with counters incrementing really indicates that the tunnel is formed OK and able to carry traffic. If this works but "real" traffic doesn't then there is most likely some basic configuration missing:


Check Intra-zone Traffic Permitted

By default you can't pass traffic between interfaces of the same zone on the SRX. It's common not to have more than one routed interface in a zone so this is easily overlooked. Just add it as follows:

root@Lab-SRX# set security policies from-zone trust to-zone trust policy allow-all match source-address any
root@Lab-SRX# set security policies from-zone trust to-zone trust policy allow-all match destination-address any
root@Lab-SRX# set security policies from-zone trust to-zone trust policy allow-all match application any
root@Lab-SRX# set security policies from-zone trust to-zone trust policy allow-all then permit
root@Lab-SRX# commit

You should now see your "real" traffic causing the VPN statistics to increment, even if the hosts at each end cannot communicate with one another.

Check AWS Routing Table

One thing that is easily forgotten when creating a new VGW is that in order to use it, a route entry must exist for the subnet sending traffic via the VGW. This needs to be created manually:


Simply edit the routing table(s) applied to your network(s) and set the next hop for your tunnelled networks to be the VPG appliance. At this point you may find that traffic from AWS towards the SRX works but in the opposite direction it does not...

Check AWS Security Group


If at this stage you have one-way connectivity then almost certainly all you need to do is to allow the VPN range inbound on your security group(s). Remember that VPC security groups are stateful and all outbound traffic (and its replies) is allowed by default.

If required, simply add rules allowing the appropriate traffic from the IP block that is tunnelled back to the SRX. In this case to keep it simple we just allow open access:


If it still doesn't work, rollback the SRX config, blow away all the elements of the VPN and start again!

Saturday, 9 April 2016

Producing topology diagrams from OSPF database CLI output

I always imagined it should be possible to automatically produce a topology diagram from the information in the OSPF database of a router - in fact I've heard of products that allow you do do this by attaching a device into your network and joining the OSPF domain. For many cases that is too invasive or completely impractical - what would be really nice would be to be able to produce this directly from the CLI output of a "show" command.

After spending a bit of time looking around, I could not find a tool to do this so I went to work using Python and came up with a basic prototype in a couple of hours. The script doesn't actually do the plotting and layout but rather produces a DOT file leaves the heavy lifting to GraphViz. With a little extra work I have now produced a working script which takes the output of "show ip ospf database router" and produces a DOT file which can be used to plot a topology map showing each OSPF router complete with the links between (including metrics) and any transit multi-access networks.

CLI output from Cisco IOS and Cisco ASA is supported (the output seems to be essentially the same) and, obviously, it doesn't matter what vendors' kit is attached into the network, provided you run the "show ip ospf database router" on a supported platform.

The tool, not-so-snappily named "ospfcli2dot" is available from my github: https://github.com/theclam/ospfcli2dot

Example


Here's a simple example of a 4 router setup. R1, R2 and R3 all sit on a shared LAN, while R4 is attached point to point to R3 and R5:





The "show ip ospf database router" command can be run from any device in the network since all devices within an area share the same topology database. The output of this is quite verbose so will not be shown here. For the purposes of this example, I have just copied and pasted the output into a file called cli-output.txt.

Simply run the script against that file:

foeh@feeble ~/Projects/ospfcli2dot $ ./ospfcli2dot
ospfcli2dot - takes the output of "show ip ospf database router" and outputs a GraphViz DOT file corresponding to the network topology

v0.2 alpha, By Foeh Mannay, April 2016

Enter input filename: cli-output.txt
Enter output filename: example.dot
foeh@feeble ~/Projects/ospfcli2dot $ dot -Tgif -oexample.gif example.dot


This creates "example.gif", shown below:


As you can see, the metric is shown against each link and the script has automatically highlighted in red that one of the point to point links has different metrics in each direction.


Please give it a try and let me know how you get on!

Links


Download: https://github.com/theclam/ospfcli2dot

Sunday, 31 January 2016

Invalid Command Stopping Cisco 7600 Supervisor Redundancy Entering SSO / Hot Standby Mode

I recently ran into a problem when trying to apply a base build to a Cisco 7600 router with dual supervisors which didn't seem to be documented anywhere, so I thought I'd record the issue and the eventual fix here.

The gist of the problem was that the secondary supervisor would not go from cold standby to hot, so in other words if the active supervisor crashed, the chassis would have to reboot in order to use the standby supervisor. The system was showing the reason for this as software mismatch, even though the two cards had the same image installed:

BUILD#show redundancy states
       my state = 13 -ACTIVE
     peer state = 4  -STANDBY COLD
           Mode = Duplex
           Unit = Primary
        Unit ID = 5

Redundancy Mode (Operational) = rpr    Reason: Software mismatch
Redundancy Mode (Configured)  = sso
Redundancy State              = rpr
     Maintenance Mode = Disabled
 Communications = Up

   client count = 159
 client_notification_TMR = 30000 milliseconds
          keep_alive TMR = 9000 milliseconds
        keep_alive count = 1
    keep_alive threshold = 18
           RF debug mask = 0x0


I won't say exactly which image this was, but it was an SSO-capable relase of IOS 15 and the two supervisors were *definitely* running the same code (one was copied from the other). The tale of software incompatibility seemed unlikely.

BUILD#show log
[snip]
*Jan  6 17:21:33.339: %SYS-SP-STDBY-5-RESTART: System restarted --
Cisco IOS Software, c7600s72033_sp Software (c7600s72033_sp-ADVIPSERVICESK9-M), Version 15.x(x)x, RELEASE SOFTWARE (xx)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2012 by Cisco Systems, Inc.
Compiled Mon 00-Jan-00 00:00 by prod_rel_team
*Jan  6 17:22:50.255 GMT: Config Sync: Bulk-sync failure due to Servicing Incompatibility. Please check full list of mismatched commands via:
  show redundancy config-sync failures mcl
*Jan  6 17:22:50.255 GMT: Config Sync: Starting lines from MCL file:
-ipv6 mfib hardware-switching replication-mode ingress
*Jan  6 17:22:50.255 GMT: %ISSU-SP-3-INCOMPATIBLE_PEER_UID: Setting image (c7600s72033_sp-ADVIPSERVICESK9-M), version (15.x(x)xx) on peer uid (6) as incompatible
*Jan  6 17:22:50.995 GMT: %RF-SP-5-RF_RELOAD: Peer reload. Reason: ISSU Incompatibility
*Jan  6 17:22:50.995 GMT: %OIR-SP-3-PWRCYCLE: Card in module 6, is being power-cycled (RF request)
*Jan  6 17:22:51.999 GMT: %PFREDUN-SP-6-ACTIVE: Standby processor removed or reloaded, changing to Simplex mode
*Jan  6 17:22:53.195 GMT: %SNMP-5-MODULETRAP: Module 6 [Down] Trap
*Jan  6 17:24:19.791 GMT: %ISSU-SP-3-PEER_IMAGE_INCOMPATIBLE: Peer image (c7600s72033_sp-ADVIPSERVICESK9-M), version (15.x(x)xx) on peer uid (6) is incompatible
*Jan  6 17:24:19.791 GMT: %ISSU-SP-3-PEER_IMAGE_INCOMPATIBLE: Peer image (c7600s72033_sp-ADVIPSERVICESK9-M), version (15.x(x)xx) on peer uid (6) is incompatible
*Jan  6 17:25:53.149 GMT: %PFREDUN-SP-4-INCOMPATIBLE: Defaulting to RPR mode (Runtime incompatible)
*Jan  6 17:25:54.154 GMT: %PFREDUN-SP-6-ACTIVE: Standby initializing for RPR mode
*Jan  6 17:25:58.471 GMT: %SYS-SP-3-LOGGER_FLUSHED: System was paused for 00:00:00 to ensure console debugging output.
*Jan  6 17:25:58.763 GMT: %FABRIC-SP-5-CLEAR_BLOCK: Clear block option is off for the fabric in slot 6.
*Jan  6 17:25:58.859 GMT: %FABRIC-SP-5-FABRIC_MODULE_BACKUP: The Switch Fabric Module in slot 6 became standby
*Jan  6 17:26:00.299 GMT: %SNMP-5-MODULETRAP: Module 6 [Up] Trap
*Jan  6 17:26:00.279 GMT: %DIAG-SP-6-BYPASS: Module 6: Diagnostics is bypassed
*Jan  6 17:26:00.375 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 6, interfaces are now online
*Jan  6 17:26:06.435 GMT: %RF-SP-5-RF_TERMINAL_STATE: Terminal state reached for (RPR)

OK, so clearly it doesn't like the "ipv6 mfib hardware-switching replication-mode ingress" command for some reason. Why it would work on one and not the other is a mystery but hey... I don't have big plans for IPv6 multicast so I don't care what replication mode it's in - let's just delete the offending command:

BUILD#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
BUILD(config)#no ipv6 mfib hardware-switching replication-mode ingress
no ipv6 mfib hardware-switching replication-mode ingress
         ^
% Invalid input detected at '^' marker.

So I can't negate the command, in fact there's no "mfib" stanza under "no ipv6":

BUILD(config)#no ipv6 ?
  access-list        Configure access lists
  [snip]
  local              Specify local options
  mld                Global mld commands
  [snip]
  spd                Selective Packet Discard (SPD)

In fact, even the original command seems to be invalid:

BUILD(config)#ipv6 mfib hardware-switching replication-mode ?
% Unrecognized command

And yet here it is in the config from which we booted:

BUILD#show start | inc ipv6 
ipv6 unicast-routing
ipv6 mfib hardware-switching replication-mode ingress
no mls flow ipv6

?!?!

I guess it's one of those legacy commands they bodge the CLI to take but you can't see in the help. But it won't take the command anyway :| Eventually I found an equivalent command that it *would* take:

BUILD(config)#no ipv6 multicast hardware-switching replication-mode ingress 
Warning: This command will change the replication mode for all address families.
 BUILD(config)#do show run | inc ipv6
ipv6 unicast-routing
no mls flow ipv6
BUILD(config)#



At last, the problem config is gone! We're almost there but not quite, the previous failures sit in the active supervisor even if the standby is reloaded so we have to kick it to re-evaluate:

BUILD#show redundancy config-sync failures mcl 
Mismatched Command List
-----------------------
-ipv6 mfib hardware-switching replication-mode ingress

BUILD#redundancy config-sync validate mismatched-commands  
*Jan  7 08:26:28.600 GMT: CONFIG SYNC: MCL validation succeeded
*Jan  7 08:26:28.600 GMT: %ISSU-SP-3-PEER_IMAGE_REM_FROM_INCOMP_LIST: Peer image (c7600s72033_sp-ADVIPSERVICESK9-M), version (15.x(x)xx) on peer uid (6) being removed from the incompatibility list
BUILD#show redundancy config-sync failures mcl 
Mismatched Command List
-----------------------

The list is Empty

BUILD#redundancy reload peer 
Reload peer [confirm]
Preparing to reload peer

BUILD#

*Jan  7 08:27:16.096 GMT:  RP sending reload request to Standby. User: admin on console, Reason: Admin reload CLI

BUILD#

Eventually...
 

*Jan  7 08:33:37.532 GMT: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded*Jan  7 08:33:37.552 GMT: %RF-SP-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)
*Jan  7 08:33:36.572 GMT: %PFREDUN-SP-STDBY-6-STANDBY: Ready for SSO mode
BUILD#show redundancy
Redundant System Information :
------------------------------
       Available system uptime = 15 hours, 20 minutes
Switchovers system experienced = 0
              Standby failures = 3
        Last switchover reason = none

                 Hardware Mode = Duplex
    Configured Redundancy Mode = sso
     Operating Redundancy Mode = sso
              Maintenance Mode = Disabled
                Communications = Up

Current Processor Information :
-------------------------------
               Active Location = slot 5
        Current Software state = ACTIVE
       Uptime in current state = 15 hours, 19 minutes
                 Image Version = Cisco IOS Software, c7600s72033_rp Software (c7600s72033_rp-ADVIPSERVICESK9-M), Version 15.x(x)xx, RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2012 by Cisco Systems, Inc.
Compiled Wed 01-Aug-12 20:15 by prod_rel_team
                          BOOT = sup-bootdisk:/c7600s72033-advipservicesk9-mz.15x-x.xx.bin,1;
                   CONFIG_FILE =
                       BOOTLDR =
        Configuration register = 0x2102

Peer Processor Information :
----------------------------
              Standby Location = slot 6
        Current Software state = STANDBY HOT
       Uptime in current state = 3 minutes
                 Image Version = Cisco IOS Software, c7600s72033_rp Software (c7600s72033_rp-ADVIPSERVICESK9-M), Version 15.x(x)xx, RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2012 by Cisco Systems, Inc.
Compiled Wed 01-Aug-12 20:15 by prod_rel_team
                          BOOT = sup-bootdisk:/c7600s72033-advipservicesk9-mz.15x-x.xx.bin,1;
                   CONFIG_FILE =
                       BOOTLDR =
        Configuration register = 0x2102
BUILD#

Win!

Saturday, 23 January 2016

Cisco Nexus Output Errors

A little while ago I was asked to investigate an IP based storage problem which had been traced back to a large amount of output errors on the port facing a particular compute node. The port was on a Cisco Nexus 5000 series device and I could see that, while output errors were clocking up at a massive rate, the switch was giving me nothing to go on as to what kind of errors they were. Every one of the usual suspects (collisions, etc) on the port showed nothing and yet the output errors were clocking up.

The ultimate answer turned out to be related to the fact that the Nexus 5k aims for low latency and as such performs cut-through switching. If you're not familiar with this term, please refer to this reasonably decent Cisco explanation, however at a high level there are two possible modes of transmission in switched networks:

1 - Store and Forward, where the entire frame is buffered into memory, the FCS is validated and then the frame is passed on. This mode can handle ports of differing speeds but obviously for large frames the serialisation delay becomes significant.
2 - Cut through, where just the header is checked for source / destination, plus any fields required for QoS / ACLs, then the rest of the frame is "cut through" onto the appropriate output port without buffering. This requires ports of an identical speed but offers lower latency.

One of the not-immediately-obvious side effects of cut through switching is that the FCS is only validated once the frame has been passed, by which point it is too late to take any corrective action. Essentially, the forwarding switch has already passed a broken fame on and, although it knows this, it can do nothing about it in retrospect and so it just says "oh, well" and increments its error counters on the ingress and egress ports.

If you are seeing output errors on a port with no other real explanation of how they got there, check other ports of the same speed for input errors. In my case it was due to a fibre fault - corrupted frames were entering one port, being cut through to another and causing errors to clock up on both.

A Down in the Weeds look at Route Distinguishers

I was recently involved in a discussion on reddit about VRF route targets and route distinguishers and I noticed that there was a lot of misinformation flying around. That doesn't really surprise me as a lot of the folks on there are learning and I've heard some jarring misconceptions on the topic come from  senior guys who have worked with MPLS for years. Most of the route target stuff was straightened out quite quickly and I will not get into any of that here, however the route distinguisher debate went on longer and covered some areas that seemed to be new or controversial to a lot of people.

The crux of the issue is that a lot of people believe the route distinguisher to be only locally significant - apparently there are many resources on the Internet which say this. I'll grant you that many are ambiguous, for example the first hit on Google for "route distinguisher and route target" says that "The route distinguisher has only one purpose, to make IPv4 prefixes globally unique. It is used by the PE routers to identify which VPN a packet belongs to". The well-respected packet life blog says "As its name implies, a route distinguisher (RD) distinguishes one set of routes (one VRF) from another. It is a unique number prepended to each route within a VRF to identify it as belonging to that particular VRF or customer." To be fair it goes on to clarify that "An RD is carried along with a route via MP-BGP when exchanging VPN routes with other PE routers", which suggests at the global significance.

In this post I hope to prove to anyone who is interested that route distinguishers are, in fact, both locally and globally significant and to demonstrate why this is important to understand.

Local Significance


If you've got this far then I assume you will already be familiar with what route targets and route distinguishers do, if not then I suggest you read up and play in the lab a while before venturing on.

The reason for needing a route distinguisher locally within a device is to extend the normal IPv4 prefixes that are known within each VRF in order to make them unique. Any locally learned IPv4 prefixes (connected, static or learned via an IPv4 routing protocol) are extended with the route distinguisher assigned to the VRF, as shown here:


It is also true that different PEs may use different route distinguishers for the same VRF without breaking anything:

PE1#show run vrf A
Building configuration...

Current configuration : 316 bytes
ip vrf A
 rd 100:2439
 route-target export 100:100
 route-target import 100:100
!
!
interface FastEthernet1/0
 ip vrf forwarding A
 ip address 192.168.1.1 255.255.255.0
 speed auto
 duplex auto
!
router bgp 100
 !
 address-family ipv4 vrf A
  redistribute connected
  redistribute static
 exit-address-family
!
end

PE1#show ip route vrf A

Routing Table: A


Gateway of last resort is not set

      192.168.1.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.1.0/24 is directly connected, FastEthernet1/0
L        192.168.1.1/32 is directly connected, FastEthernet1/0
B     192.168.24.0/24 [200/0] via 10.255.255.2, 00:04:03
PE1#



PE2#show run vrf A
Building configuration...

Current configuration : 317 bytes
ip vrf A
 rd 100:2458
 route-target export 100:100
 route-target import 100:100
!
!
interface FastEthernet1/0
 ip vrf forwarding A
 ip address 192.168.24.1 255.255.255.0
 speed auto
 duplex auto
!
router bgp 100
 !
 address-family ipv4 vrf A
  redistribute connected
  redistribute static
 exit-address-family
!
end

PE2#show ip route vrf A

Routing Table: A

Gateway of last resort is not set

B     192.168.1.0/24 [200/0] via 10.255.255.1, 00:03:44
      192.168.24.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.24.0/24 is directly connected, FastEthernet1/0
L        192.168.24.1/32 is directly connected, FastEthernet1/0
PE2#


So it's easy to see how the idea got started that RDs are only locally significant:

Route distinguishers don't need to match between devices in the same VRF in order for routes to be shared between them.

Global Significance


The first clue at the global significance of the route distinguisher is that it is carried in the MP-BGP updates:

PE1#show bgp vpnv4 unicast all 192.168.24.0/24 
BGP routing table entry for 100:2439:192.168.24.0/24, version 8
Paths: (1 available, best #1, table A)
  Not advertised to any peer
  Refresh Epoch 2
  Local, imported path from 100:2458:192.168.24.0/24 (global)
    10.255.255.2 (metric 3) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/28
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2458:192.168.24.0/24, version 7
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 2
  Local
    10.255.255.2 (metric 3) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/28
      rx pathid: 0, tx pathid: 0x0
PE1#

Interestingly, we have 2 different prefixes here. One is the original (100:2458:192.168.24.0/24) which we learned over the network, while the other is the same IPv4 prefix but prepended with the RD of the VRF which imports it (100:2439:192.168.24.0/24). If we imported it into multiple VRFs then we would have an additional copy for each RD used by the respective VRFs.

If the RD were only locally significant then why would the protocol designers send it? You may be thinking "otherwise you couldn't overlap prefixes!", but surely route targets would be enough to achieve this? If you heard a prefix of 10.0.0.0/8 announced with a route target imported by VRF A then you would import it into VRF A and not VRF B, if you heard a different announcement for 10.0.0.0/8 with a route target imported by VRF B then you would import it into VRF B and not VRF A.

That could kind of work, in theory, but it would essentially break the whole BGP paradigm as you would have multiple copies of the same prefix in use concurrently for different purposes. BGP likes to determine the best path and only offers that into the FIB. With a unique RD against each of the two 10.0.0.0/8 routes, BGP is able to do its best path determination and pass the two, now different, routes into their respective VRFs.

So the route distinguishers overcome that problem, but is that the only reason why they are carried in MP-BGP? That would be a fairly weak argument for global significance, but the best path point here touches on a much stronger case.

The Route Reflector Problem


One key thing to bear in mind which is often forgotten in the grand scheme of things is that the PE is not the only place where BGP best path calculations happen. Any MPLS network of even moderate scale will be using BGP route reflectors to keep the number of BGP sessions under control, and the route reflectors themselves perform a best path determination on the routes they receive before sending them out to their route reflector clients.

This extends the previous case to all the route reflector's clients, so essentially the entire AS. Let's take an example where the admin has been sloppy and has failed to keep RDs globally unique:



Notice that VRF A uses import / export RT 100:100, VRF B uses import / export RT of 100:200. The network administrator has tried to assign unique route distinguishers per VRF per device, but has made an error and overlapped the route targets used on PE1's VRF A and PE3's VRF B.

The two VRFs are completely distinct from one another and they are not even present on the same PEs. We can see that VRF A is only learning VRF A's routes and VRF B is only learning VRF B's routes:

PE1#show ip vrf
  Name                             Default RD          Interfaces
  A                                100:2439            Fa1/0
PE1#show ip route vrf A
Routing Table: A

Gateway of last resort is not set

      192.168.1.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.1.0/24 is directly connected, FastEthernet1/0
L        192.168.1.1/32 is directly connected, FastEthernet1/0
B     192.168.24.0/24 [200/0] via 10.255.255.2, 00:05:01
PE1#

PE2#show ip vrf
  Name                             Default RD          Interfaces
  A                                100:2458            Fa1/0
PE2#show ip route vrf A

Routing Table: A

Gateway of last resort is not set

B     192.168.1.0/24 [200/0] via 10.255.255.1, 00:05:15
      192.168.24.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.24.0/24 is directly connected, FastEthernet1/0
L        192.168.24.1/32 is directly connected, FastEthernet1/0
PE2#

PE3#show ip vrf
  Name                             Default RD          Interfaces
  B                                100:2439            Fa1/0
PE3#show ip route vrf B

Routing Table: B

Gateway of last resort is not set

      192.168.3.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.3.0/24 is directly connected, FastEthernet1/0
L        192.168.3.1/32 is directly connected, FastEthernet1/0
B     192.168.19.0/24 [200/0] via 10.255.255.4, 00:01:54
PE3#

PE4#show ip vrf
  Name                             Default RD          Interfaces
  B                                100:2895            Fa1/0
PE4#show ip route vrf B

Routing Table: B

Gateway of last resort is not set

B     192.168.3.0/24 [200/0] via 10.255.255.3, 00:02:21
      192.168.19.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.19.0/24 is directly connected, FastEthernet1/0
L        192.168.19.1/32 is directly connected, FastEthernet1/0
PE4#

Now, let's introduce an additional subnet on VRF A. It uses the same address space as VRF B but they are completely separate so that should be fine (right?!).

PE1(config)#ip route vrf A 192.168.3.0 255.255.255.0 192.168.1.10

Customer A is now happy as their new network is reachable over the VRF but all of a sudden we have customer B on the phone, complaining that their site (which used to work) is off the air. Looking into PE 4 we can see why:

PE4#show ip route vrf B

Routing Table: B

Gateway of last resort is not set

      192.168.19.0/24 is variably subnetted, 2 subnets, 2 masks
C        192.168.19.0/24 is directly connected, FastEthernet1/0
L        192.168.19.1/32 is directly connected, FastEthernet1/0
PE4#

The route to 192.168.3.0/24 has disappeared! Why is that? Looking on the route reflector gives us the answer:


RR#show bgp vpnv4 uni all 192.168.3.0/24
BGP routing table entry for 100:2439:192.168.3.0/24, version 14
Paths: (2 available, best #1, no table)
  Advertised to update-groups:
     3
  Refresh Epoch 1
  Local, (Received from a RR-client)
    10.255.255.1 (metric 3) from 10.255.255.1 (10.255.255.1)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      mpls labels in/out nolabel/25
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 1
  Local, (Received from a RR-client)
    10.255.255.3 (metric 3) from 10.255.255.3 (10.255.255.3)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:100:200
      mpls labels in/out nolabel/28
      rx pathid: 0, tx pathid: 0
RR#

Both 192.168.3.0/24 prefixes are being advertised with the same RD but different route targets. The route reflector has, therefore, seen the two "identical" prefixes and has chosen a best path - for want of a better metric it has chosen based on lowest next hop IP (PE1, VRF A):



Since the route reflector only advertises best paths to its clients, that means nobody gets to hear about the route from PE 3, VRF B. The route advertised from PE1 has a route target of 100:100, which doesn't match any VRFs on PE4 so it just discards the route leaving it with no way to reach the 192.168.3.0/24 network.

This proves that:

If you fail to apply globally unique route distinguishers on at least a per VRF basis, changes in one VRF can impact on another. This is irrespective of whether there are devices common to the two VRFs and occurs even when their route targets are completely different.

Policy at the PE

A similar but more subtle example of where globally unique route distinguishers are a benefit is in the case where you have a multi-homed network connected or routed via two PEs for resilience.
We want to use the the purple link to reach this prefix when it's available for administrative reasons (say the purple link is cheaper, or faster). Routes learned over the purple link are tagged with community 100:123 to allow upstream PEs to recognise this. Let's compare the case where both PEs use the same RD vs. the case where each PE uses a unique RD for the same VRF. Firstly, the same RD:



PE1 and PE2 are set to use the same RD. PE3 wants to use purple routes so it is set up with a policy to favour anything with a 100:123 community attached, as follows:

ip vrf A
 rd 100:2512
 import map A-import-map
 route-target export 100:100
 route-target import 100:100
!
route-map A-import-map permit 10
 match community purple
 set local-preference 200
!
route-map A-import-map permit 20
 set local-preference 100
!
ip community-list standard purple permit 100:123

For some reason, though, our traffic all goes out via the orange link. What is happening here is that the route reflector is again receiving  two identical prefixes - this does not cause a reachability problem as both prefixes reside within the same VRF, but it does mean that the route reflector makes a best path determination and discards one of the routes. PE3 only receives one route so its policy has to take what it can get:

PE3#show bgp vpnv4 uni all 192.168.200.0/24
BGP routing table entry for 100:2439:192.168.200.0/24, version 61
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 4
  200
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2512:192.168.200.0/24, version 68
Paths: (1 available, best #1, table A)
  Not advertised to any peer
  Refresh Epoch 4
  200, imported path from 100:2439:192.168.200.0/24 (global)
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0x0
PE3#

Clearly this is not doing what we want. The local VRF table only has one option, and that's the orange route. Let's try the same thing but with unique RDs per VRF per PE:



Now we see this at PE3:

PE3#show bgp vpnv4 uni all 192.168.200.0/24
BGP routing table entry for 100:2439:192.168.200.0/24, version 61
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 4
  200
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2458:192.168.200.0/24, version 62
Paths: (1 available, best #1, no table)
  Not advertised to any peer
  Refresh Epoch 4
  200
    10.255.255.2 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Community: 6553723
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/29
      rx pathid: 0, tx pathid: 0x0
BGP routing table entry for 100:2512:192.168.200.0/24, version 64
Paths: (2 available, best #1, table A)
  Not advertised to any peer
  Refresh Epoch 4
  200, imported path from 100:2458:192.168.200.0/24 (global)
    10.255.255.2 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 200, valid, internal, best
      Community: 6553723
      Extended Community: RT:100:100
      Originator: 10.255.255.2, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/29
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 4
  200, imported path from 100:2439:192.168.200.0/24 (global)
    10.255.255.1 (metric 4) from 10.255.255.200 (10.255.255.200)
      Origin incomplete, metric 0, localpref 100, valid, internal
      Extended Community: RT:100:100
      Originator: 10.255.255.1, Cluster list: 10.255.255.200
      mpls labels in/out nolabel/30
      rx pathid: 0, tx pathid: 0
PE3#

Now we can see that two routes are received (purple and orange) and our route map has taken effect, pushing the purple route up to a better local preference and causing it to be selected into the VRF B table.

In summary, if you use the same route distinguisher at more than one point where the same IP prefix is learned, the best path determination will occur at the route reflector, not the receiving PE. This best path determination is likely to be quite coarse and applying per-VRF policies on route reflectors is inappropriate. Using unique RDs ensures that multiple copies of the same IP prefix can be learned by other PEs, allowing the best path determination to be done by the receiving PE using arbitrary local policies on a per-VRF basis.

Fast Failover


The final example is one of the most widely seen use cases for unique RD per VRF per PE. Let's take a look at the failover times for a route to move between PE1 and PE2 in the following scenario:



In the case of matching RDs, only one route for the destination is learned throughout the network so when a failure occurs a series of BGP updates need to occur before traffic can switch paths. In a real environment this chain of updates may take time. In a scaled environment (and for illustrative purposes in this lab), there may be hierarchical route reflectors and these may be configured with an update delay. Here is an example failover with two-tiered route reflectors and an update delay of 10s:

CE2#ping 1.1.1.1 repeat 100000

Type escape sequence to abort.
Sending 100000, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.......UUUUUUUUUUUUUUUUU
UUUUU.UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU.UUUUUUUUUUUUUUUU
UUUUUUUUUUUUUUUUUUUUUUUUUUU.UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
UUUUU.UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU.!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
Success rate is 37 percent (130/344), round-trip min/avg/max = 4/8/152 ms
CE2#

This failover takes around 30 seconds due to cascading updates being batched and delayed multiple times. The "U" marks above show that the edge PE has no route to the destination, due to having received the withdraw from the primary path but not yet having received the advertisement from the standby. The diagram below shows the BGP updates which need to take place before routing converges to the standby path:


Compare this to the output when unique RDs are set, meaning that the alternate path is already learned throughout the network but is simply not selected by the ingress PE:


 
CE2#ping 1.1.1.1 repeat 100000

Type escape sequence to abort.
Sending 100000, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.......!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
Success rate is 94 percent (131/139), round-trip min/avg/max = 5/9/16 ms
CE2#

As you can see, the failover is much faster, around 10-15s, and there are no unreachables as the PE always has a route in its table (even if it is a stale one for a while).

Super Fast Failover


This can be improved further by using label per VRF mode in addition to unique RDs. Without going into too much detail, the standard mode for Cisco IOS is to generate a label per prefix. The LFIB of the generating PE will have an entry saying "if I receive label X, I will stick encap Y on it and throw it out of interface Z". This can be changed as follows:

PE2(config)#mpls label mode vrf A protocol bgp-vpnv4 per-vrf 

In label per VRF mode, the same label is advertised for all prefixes advertised from within a particular VRF - the corresponding LFIB entry essentially says "rip off the label and route the packet that follows". When in label per VRF mode, we don't wait for any BGP updates at all because the egress PE where the primary link just failed can instantly use the standby route, which it already knows thanks to the unique RDs. Traffic gets U-turned back into the MPLS network while the BGP convergence occurs, but the traffic at least arrives:



Traffic temporarily hops via the primary PE into the secondary, restoring connectivity while BGP takes its sweet time to converge. Once the routing updates have propagated, traffic will go directly to the secondary PE. Failover times here are much more impressive:

CE2#ping 1.1.1.1 repeat 100000

Type escape sequence to abort.
Sending 100000, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 99 percent (200/201), round-trip min/avg/max = 5/7/12 ms
CE2#

One ping lost / two seconds to fail over. Much better, and only possible with unique RDs!

Using unique RDs allows for much faster failover times, due to decreased numbers of BGP updates being required to converge following failures. This is particularly true when using label per VRF mode, since egress PEs can U-turn traffic without waiting for any BGP convergence at all.