Sunday, 7 May 2017

Building Chassis Cluster on Juniper SRX

For a while I've wanted to post about Juniper SRX chassis cluster - I had to do some in-depth troubleshooting on it once and found that the information I needed was scattered across several documents and proved tricky to bring together.

Anyway, a couple of weeks ago I found the time to create a YouTube video showing how to do the basic setup of a chassis cluster, how to tell it is working, how to manually fail over, how to recover a disabled node and finally how to remove the chassis cluster configuration. The video is here:


However, at just over half an hour long it's a bit unwieldy if you only want some of the information! For easy searching, I've decided to write this accompanying blog post.

Chassis Cluster Concepts


Chassis cluster is Juniper's hardware HA mechanism for the SRX series. The chassis cluster mechanism keeps configurations and connection state replicated between devices so that, in the event of a failure, a standby device can take over the functions of the cluster with little downtime.

When a pair of devices are configured for chassis cluster, a raft of new port types come into play:

  • Control (CTRL) - The key to chassis cluster - responsible for replicating configuration and state, liveness checks and other housekeeping tasks. The CTRL port is usually chosen by JunOS and is fixed for a particular hardware type.
  • Fabric (fab0 / fab1) - Used to carry traffic between devices when a port goes down on the active device and traffic enters the standby (or for ports configured only on one member device). The fabric port is also used to avoid split brain in the event of a control link failure - if CTRL goes away but fabric remains, the two devices know not to both go active. Fabric ports are configured manually and there may be up to 2 pairs.
  • Out of Band Management (fxp0 and fxp1) - Used to manage the individual devices. The fxp interfaces do not fail over when there is a mastership change and always belong to the specific member, allowing management access to both devices irrespective of which is master. fxp ports are usually chosen by JunOS and will be fixed for a given hardware platform.
  • Redundant Ethernet (reth) - When a port on each device is configured for the same purpose, it is called a redundant Ethernet or "reth" interface. The active member of a reth moves as mastership changes and when there are connectivity failures. An arbitrary number of reth interfaces may be configured, depending on how many ports are available.

The diagram below shows our topology:


Preparing for Chassis Cluster


When setting up a chassis cluster I strongly recommend completely blowing away the configuration on your devices first. SRX 100 (and potentially other platforms) do not let you have Ethernet switching configuration while in chassis cluster mode, so even the vanilla factory default configuration can upset chassis cluster. Just clear everything, set the root password then commit:

root> configure
Entering configuration mode

[edit]
root# delete
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes

[edit]
root# set system root-authentication plain-text-password
New password:
Retype new password:


root# commit and-quit
commit complete
Exiting configuration mode


root>

Once this is done on both devices, we can begin the chassis cluster configuration.

Enabling Chassis Cluster


The first task with chassis cluster is to choose a cluster ID (from 0 to 255, however 0 means "no chassis cluster"). The cluster ID must match between members. In most cases you can just pick 1 but when there are multiple clusters on the same layer 2 network you will need to use different cluster IDs for each cluster.

At the same time the cluster ID is applied, we must also apply the node ID. Each device must have a unique node ID within the cluster, in fact 0 and 1 are the only valid node IDs so in essence one device must be node 0 and the other node 1.

In order to take effect, the devices must be rebooted and this can be requested by adding the "reboot" keyword:

root> set chassis cluster cluster-id 1 node 0 reboot
Successfully enabled chassis cluster. Going to reboot now.

Note: this is done from the exec CLI context, not the configuration context. Perform the same command on the second device but using node 1 in place of node 0.

Once the devices have rebooted, you will notice that the prompt has changed to indicate the node number and activity status ("primary"or "secondary"). For a brief period following boot, the status on both devices will show as "hold" - during this time the device refrains from becoming active while it checks to see if there is another cluster member already serving.

{primary:node0}
root>


and

{secondary:node1}
root>


Once the devices have entered this state, configuration applied to one device will replicate to the other (in either direction). From here the rest of the chassis cluster configuration can be applied.

Note: If the devices both become active, check that their control link is up.

Redundancy Groups


Redundancy groups are primarily used to bundle resources that need to fail over together. Resources in the same redundancy group are always "live" on the same member firewall as each other, while different redundancy groups can be active on the same or different devices to one another. This allows some resources to be active on one cluster member while others are active on the opposite device.

To begin with there is only one redundancy group, group 0, which decides which firewall is the active routing engine. As a best practice you should always influence which device will become master in the event of a simultaneous reboot. This is done by setting the priority as follows:

{primary:node0}[edit]
root# set chassis cluster redundancy-group 0 node 0 priority 100
{primary:node0}[edit]
root# set chassis cluster redundancy-group 0 node 1 priority 50


Note that group 0 is not and cannot be pre-emptive, i.e. a higher priority only takes effect when there is an election (i.e. at boot time), not if a higher priority device appears while a lower priority device is active.

Later on we will configure redundant Ethernet interfaces, which is where groups 1 and upwards come into play. We will create group 1 ready for the interfaces - note this can be set to pre-empt if you like:

{primary:node0}[edit]
root# set chassis cluster redundancy-group 1 node 0 priority 100
{primary:node0}[edit]
root# set chassis cluster redundancy-group 1 node 1 priority 50


Here I've set the priorities the same between the groups - it's very possible to have group 0 active on one device and group 1 active on the other, however it's messy, lots of traffic has to traverse the fabric link(s) and it increases your exposure - in that state, failure of either firewall would have some impact on connectivity whereas when they are aligned you could lose the standby with no impact to service.

For this reason, I don't configure pre-empt - that way all groups should be active on the same device unless manually tweaked. If you'd rather it be revertive, use this command:

{primary:node0}[edit]
root@SRX-top# set chassis cluster redundancy-group 1 preempt


Applying Configuration to Single Devices


While it's useful to have exactly the same configuration on the firewalls for most things, it is very useful to be able to keep some configuration unique per device. Good examples of this are the device hostname and management IP address.

This is achieved using groups called "node0" and "node1" which are applied per device using a special macro:

{primary:node0}[edit]
root# set groups node0 system host-name SRX-top
{primary:node0}[edit]
root# set groups node0 interfaces fxp0 unit 0 family inet address 172.16.1.1/24
{primary:node0}[edit]
root# set groups node1 system host-name SRX-bottom
{primary:node0}[edit]
root# set groups node1 interfaces fxp0 unit 0 family inet address 172.16.1.2/24

{primary:node0}[edit]
root# set apply-groups ${node}


What we've done here is to define a node0 group which defines the hostname and out of band IP for node 0, then the same for node 1. Finally, the apply group uses the "${node}" macro to apply the node0 group to node 0 and node1 to node 1.

Defining Fabric Interfaces


In order to allow traffic to traverse between the clustered devices, at least one fabric interface per node must be configured (up to two per node is allowed). In our case we will configure this up on port 5 so that it is adjacent to the other special purpose ports. There are two "fab" virtual interfaces on the cluster, fab0 associated to node 0 and fab1 associated with fab1:

{primary:node0}[edit]
root# set interfaces fab0 fabric-options member-interfaces fe-0/0/5
{primary:node0}[edit]
root# set interfaces fab1 fabric-options member-interfaces fe-1/0/5


Note that ports on node 0 are denoted by fe-0/x/x while ports on node 1 are denoted by fe-1/x/x. If you want a dual fabric (either for resilience or to cope with a lot of inter-chassis traffic in the event of a failover) then just add the second interface on each side in exactly the same way.

Redundant Ethernet (reth) Interfaces


In order to be highly available, each traffic interface needs to have a presence on node 0 and node 1 (otherwise interfaces would be lost when a failover occurs). In the SRX chassis cluster world, this pairing of interfaces is done using a redundant Ethernet or "reth" virtual interfaces.

The first step in configuring redundant Ethernet interfaces is to decide how many are allowed (similar to ae interfaces):

{primary:node0}[edit]
root# set chassis cluster reth-count 5


Next we configure the member interfaces that will belong to each reth (note: on higher end SRX this will be gigether-options or ether-options rather than fastether-options):

{primary:node0}[edit]
root# set interfaces fe-0/0/0 fastether-options redundant-parent reth0
{primary:node0}[edit]
root# set interfaces fe-1/0/0 fastether-options redundant-parent reth0


...and assign the reth to a redundancy group (mentioned earlier):

{primary:node0}[edit]
root# set interfaces reth0 redundant-ether-options redundancy-group 1


At this point the reth can be configured like any other routed interface - units, address families, security zones, etc. are all used in exactly the same way as a normal port.

For reference, here's the full configuration as used in the video:

set groups node0 system host-name SRX-top
set groups node0 interfaces fxp0 unit 0 family inet address 172.16.1.1/24
set groups node1 system host-name SRX-bottom
set groups node1 interfaces fxp0 unit 0 family inet address 172.16.1.2/24
set apply-groups "${node}"
set system root-authentication encrypted-password "$1$X8eRYomW$Wbxj8V0ySW/5dQCXrkYD70"
set chassis cluster reth-count 5
set chassis cluster redundancy-group 0 node 0 priority 100
set chassis cluster redundancy-group 0 node 1 priority 50
set chassis cluster redundancy-group 1 node 0 priority 100
set chassis cluster redundancy-group 1 node 1 priority 50
set interfaces fe-0/0/0 fastether-options redundant-parent reth0
set interfaces fe-0/0/1 fastether-options redundant-parent reth1
set interfaces fe-1/0/0 fastether-options redundant-parent reth0
set interfaces fe-1/0/1 fastether-options redundant-parent reth1
set interfaces fab0 fabric-options member-interfaces fe-0/0/5
set interfaces fab1 fabric-options member-interfaces fe-1/0/5
set interfaces reth0 redundant-ether-options redundancy-group 1
set interfaces reth0 unit 0 family inet address 10.10.10.10/24
set interfaces reth1 redundant-ether-options redundancy-group 1
set interfaces reth1 unit 0 family inet address 192.168.0.1/24
set security nat source rule-set trust-to-untrust from zone trust
set security nat source rule-set trust-to-untrust to zone untrust
set security nat source rule-set trust-to-untrust rule nat-all match source-address 0.0.0.0/0
set security nat source rule-set trust-to-untrust rule nat-all then source-nat interface
set security policies from-zone trust to-zone untrust policy allow-all match source-address any
set security policies from-zone trust to-zone untrust policy allow-all match destination-address any
set security policies from-zone trust to-zone untrust policy allow-all match application any
set security policies from-zone trust to-zone untrust policy allow-all then permit
set security zones security-zone untrust interfaces reth0.0
set security zones security-zone trust interfaces reth1.0


Checking and Troubleshooting


Now that the configuration is in place,  we should verify its status. There are a number of commands we can use to check the operation of chassis cluster, probably the most frequently used one would be "show chassis cluster status":

{primary:node0}
root@SRX-top> show chassis cluster status
Monitor Failure codes:
    CS  Cold Sync monitoring        FL  Fabric Connection monitoring
    GR  GRES monitoring             HW  Hardware monitoring
    IF  Interface monitoring        IP  IP monitoring
    LB  Loopback monitoring         MB  Mbuf monitoring
    NH  Nexthop monitoring          NP  NPC monitoring
    SP  SPU monitoring              SM  Schedule monitoring

Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 1
node0  100      primary        no      no       None
node1  50       secondary      no      no       None

Redundancy group: 1 , Failover count: 1
node0  100      primary        no      no       None
node1  50       secondary      no      no       None


From this you can see the configured priority for each node against each redundancy group, along with its operational status (i.e. whether it is acting as the primary or secondary).

Another useful command to verify proper operation is "show chassis cluster statistics":

{primary:node0}
root@SRX-top> show chassis cluster statistics
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 424101
        Heartbeat packets received: 424108
        Heartbeat packet errors: 0
Fabric link statistics:
    Child link 0
        Probes sent: 834746
        Probes received: 834751
    Child link 1
        Probes sent: 0
        Probes received: 0
Services Synchronized:
    Service name                              RTOs sent    RTOs received
    Translation context                       0            0
    Incoming NAT                              0            0
<snip>

In this output we would expect to see the number of control link heartbeats to be steadily increasing over time (more than 1 per second) and the same for the probes. Usefully, if you have dual fabric links then it shows activity for each separately so that you can determine the health of both.

One of the most useful commands available (which sometimes in older versions was not visible in the CLI help and would not auto-complete but would still run if typed completely) is "show chassis cluster information":

{primary:node0}
root@SRX-top> show chassis cluster information
node0:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: primary, Weight: 255

        Time            From           To             Reason
        Apr 28 15:06:58 hold           secondary      Hold timer expired
        Apr 28 15:07:01 secondary      primary        Better priority (1/1)

    Redundancy Group 1 , Current State: primary, Weight: 255

        Time            From           To             Reason
        May  3 12:59:58 hold           secondary      Hold timer expired
        May  3 12:59:59 secondary      primary        Better priority (100/50)

Chassis cluster LED information:
    Current LED color: Green
    Last LED change reason: No failures
Control port tagging:
    Disabled

node1:
--------------------------------------------------------------------------
Redundancy Group Information:

    Redundancy Group 0 , Current State: secondary, Weight: 255

        Time            From           To             Reason
        Apr 28 15:06:34 hold           secondary      Hold timer expired

    Redundancy Group 1 , Current State: secondary, Weight: 255

        Time            From           To             Reason
        May  3 12:59:54 hold           secondary      Hold timer expired

Chassis cluster LED information:
    Current LED color: Green
    Last LED change reason: No failures
Control port tagging:
    Disabled


The brilliant part about this command is that it shows you a history of exactly when and why the firewalls last changed state. The "Reason" field is really quite explanatory, giving reasons such as "Manual failover".

Manual Failover


Failing over the SRX chassis cluster is not quite as straightforward as with some other vendors' firewalls - for a start there are at least 2 redundancy groups to fail over, but in addition to that the forced activity is 'sticky', i.e. you have to clear out the forced mastership to put the cluster back to normal.

So let's say we have node0 active on both redundancy groups:

{secondary:node1}
root> show chassis cluster status 
<snip> 
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       secondary      no      no       None           

Redundancy group: 1 , Failover count: 2
node0  100      primary        no      no       None           
node1  50       secondary      no      no       None          

We can fail over redundancy group 1 as follows:

{secondary:node1}
root> request chassis cluster failover redundancy-group 1 node 1 
node1:
--------------------------------------------------------------------------
Initiated manual failover for redundancy group 1

Now when we check, we can see that node1 is the primary as expected but also its priority has changed to 255 and the "Manual" column shows "yes" for both devices. This indicates that node1 is forced primary and, effectively, can't be pre-empted even if that is set up on the group:

{secondary:node1}
root> show chassis cluster status 
<snip> 
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       secondary      no      no       None           

Redundancy group: 1 , Failover count: 3
node0  100      secondary      no      yes      None           
node1  255      primary        no      yes      None           

If you have pre-empt enabled on the redundancy-group then you will need to leave it like this for as long as you want node1 to remain active. If not then you can clear the forced mastership out immediately:

root> request chassis cluster failover reset redundancy-group 1 
node0:
--------------------------------------------------------------------------
No reset required for redundancy group 1.

node1:
--------------------------------------------------------------------------
Successfully reset manual failover for redundancy group 1

Just remember to do this for both (or all) redundancy groups if you want to take node0 out of service for maintenance.

Fabric Links and Split Brain


In addition to transporting traffic between cluster members when redundancy-groups are active on different members, the fabric link or links carry keepalive messages. This not only ensures that the fabric links are usable but is also used as a method to prevent "split brain" in the event that the single control link goes down.

The logic that the SRX uses is as follows:

If the control link is lost but fabric is still reachable, the secondary node is immediately put into an "ineligible" state:

{secondary:node1}
root> show chassis cluster status    
<snip>
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  0        lost           n/a     n/a      n/a            
node1  50       ineligible     no      no       None           

Redundancy group: 1 , Failover count: 4
node0  0        lost           n/a     n/a      n/a            
node1  50       ineligible     no      no       None           

If the fabric link is also lost during the next 180s then the primary is considered to be dead and the secondary node becomes primary. If the fabric link does is not lost during the 180s window then the standby device switches from "ineligible" to "disabled". Even if the control link recovers, as shown here (the partner node changes from "lost" to "primary"):

{ineligible:node1}
root> show chassis cluster status    
<snip>
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       ineligible     no      no       None           

Redundancy group: 1 , Failover count: 4
node0  100      primary        no      no       None           
node1  50       ineligible     no      no       None           

Once 180s passes, the device will still go into a "disabled" state:

{ineligible:node1}
root> show chassis cluster status         
<snip> 
Cluster ID: 1
Node   Priority Status         Preempt Manual   Monitor-failures

Redundancy group: 0 , Failover count: 0
node0  100      primary        no      no       None           
node1  50       disabled       no      no       None           

Redundancy group: 1 , Failover count: 4
node0  100      primary        no      no       None           
node1  50       disabled       no      no       None           

The output of "show chassis cluster information" makes it quite clear what happened:

        May  7 16:46:38 secondary      ineligible     Control link failure
        May  7 16:49:38 ineligible     disabled       Ineligible timer expired

From the disabled state, the node can never become active. To recover from becoming "disabled", the affected node must be rebooted (later releases allow auto recovery, but this seems to just reboot the standby device anyway and that idea rubs me up the wrong way).

Removing Chassis Cluster


In order to remove chassis cluster from your devices, just go onto each node and run:

root@SRX-bottom> set chassis cluster disable    

For cluster-ids greater than 15 and when deploying more than one
cluster in a single Layer 2 BROADCAST domain, it is mandatory that
fabric and control links are either connected back-to-back or
are connected on separate private VLANS.

Also, while not absolutely required, I strongly recommend:

{secondary:node1}
root@SRX-bottom> request system zeroize 
warning: System will be rebooted and may not boot without configuration
Erase all data, including configuration and log files? [yes,no] (no) yes 

error: the ipsec-key-management subsystem is not responding to management requests
warning: zeroizing node1

Bye, bye, chassis cluster!