Saturday 11 July 2015

Testing the Impact of Local Packet Capture on the Cisco 6500 Series

For a while now, many of the larger Cisco devices (such as 6500 and 7600s) have supported local packet capture. I've always been hesitant to use these in a production environment, primarily due to concerns about the potential performance hit it could cause. Anyway, I decided to test out the potential impact in the lab to see whether it was sometimes / always / never safe to run local captures.

Test Setup

For my test setup I used a 6504-E switch with a modest SUP32-GE-3B supervisor and a 12.2 Advanced IP Services IOS - if it works on that, it should be safe anywhere. For traffic, I used a spare server running Ubuntu with a combination of Ostinato and Scapy.
The configuration was as follows:
Running at full tilt, Ostinato was happily producing 1 Gbps of traffic which went into my node on VLAN 100, out around the loop cable, back into the node and then out of the same interface on VLAN 101. Basically the port is running at 1 Gbps in each direction, so the worst possible case for mirroring a gig port.

Default Settings

The default settings for the capture are pretty conservative - a tiny 2 MB linear capture buffer with a rate limit of 10,000 frames per second. With this config, a 1 Gbps stream of 1500 byte packets fills the buffer in ~ 2.5s, triggering the capture to end. The impact of this is almost impossible to measure at all, with the capture being over so quickly you may not actually see any change in CPU on the 5 second roll-ups.
Lab-6503E#monitor capture start
*Jul 11 14:30:59.205: %SPAN-5-PKTCAP_START: Packet capture session 1 started
Lab-6503E#
*Jul 11 14:31:01.449: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended as the buffer is full, 21845 packets captured
Lab-6503E#show proc cpu hist
                                                              
                                                              
                         222223333322222               8888811
100                                                           
 90                                                           
 80                                                           
 70                                                           
 60                                                           
 50                                                           
 40                                                           
 30                                                           
 20                                                           
 10                                                    *****  
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5    
               CPU% per second (last 60 seconds)

Worst Case

OK, so the world didn't end. The next step was to see how bad it could be so I made the following changes:
  • Increased the rate limit to 100,000 fps (max)
  • Increased the packet buffer to 64 MB (max)
  • Enabled a circular buffer (why?!)
This time I set the capture to run for 60 seconds and checked the impact, which was much more severe:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 14:45:02.953: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 14:46:02.945: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 699040 packets captured
Lab-6503E#show proc cpu hist

    2222244444444444444455555444444444444444444444444444444444
    9999999999888889999966666999999999999999999998888899999999
100
 90
 80
 70
 60                     *****
 50      *****************************************************
 40      *****************************************************
 30 **********************************************************
 20 **********************************************************
 10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5

               CPU% per second (last 60 seconds)

The impact to the supervisor in this case was much more noticeable - up to 60% CPU utilisation. The scenario is pretty unrealistic (forget circular buffers!) but suffice to say I wouldn't want to do that on a production device.
Now we have a the best and the worst cases, let's look at some realistic use cases and explore some of the other capture parameters that might help us capture what we need without causing havoc on the network.

Narrowing Down the Capture

It's possible to define criteria to decide what gets captured - as the CLI points out, some of these criteria are processed in hardware while others are handled in software:
Lab-6503E(config-mon-capture)#filter ?
  access-group  Filter access-list (hardware based)
  ethertype     Matching ethertype (software based)
  length        Matching L2-packet length (software based)
  mac-address   Matching mac-address (software-based)
  vlan          Filter vlan (hardware based)
Our test traffic consists of nearly a gig of junk run alongside a small 1-per-second ping, which we will decide is "interesting" to us and we want to capture. This traffic profile makes it easy to test ACL, MAC address and length filters and their relative performances.

ACL Filter

As the CLI says, ACL filters are applied in hardware so the junk is discarded at source before it hits the CPU. In this example I set up an ACL as follows:
Lab-6503E(config)#ip access-list extended icmp-only
Lab-6503E(config-ext-nacl)#permit icmp any any
Lab-6503E(config-ext-nacl)#deny ip any any
... and applied it to my capture as follows:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#filter access-group icmp-only
Now, I re-ran the "worst case" test, with massively different results:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 14:49:00.345: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 14:50:00.337: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 60 packets captured
Lab-6503E#show proc cpu hist

         22222222223333355555          11111
100
 90
 80
 70
 60
 50
 40
 30
 20
 10                     *****
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)
Two things to note here - because this is a hardware filter, applied in the ASICs:
  • Only 1 packet per second was punted to the CPU, resulting in essentially no hit at all
  • All 60 of the ping packets were received and nothing else
In summary, ACL filters are ideal when you know what you want to pull from a big stream as you only pay a CPU penalty for the "good stuff".

MAC Filter

In contrast, the MAC filter runs in software. It's also pretty limited, only matching on source MAC. I repeated the above test but instead of an ACL match, applied a MAC filter as follows:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#filter mac-address 0011.2233.4455
This took us more-or-less back to the worst case:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 15:05:55.197: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 15:06:55.189: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 5 packets captured
Lab-6503E#show proc cpu hist

      44444444443333344444333334444444444444444444444444333333
    4422222111119999900000999990000011111333331111111111888889
100
 90
 80
 70
 60
 50
 40   ********************************************************
 30   ********************************************************
 20   ********************************************************
 10   ********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)

Yuck. I wouldn't do that in production. This is basically because the software filters are applied after the packets have been punted to the CPU, so you pay a penalty for the garbage as well as the good stuff. You'll notice that it only captured 5 packets as well, more on this later but that's another side effect of software filters.

Length Filter

The frame length filter is another software-based mechanism, which means it's pretty terrible under load, too. Our junk traffic consists of large frames, our interesting traffic is small, so let's configure the capture to only catch the short frames:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#filter length 0 100 
Again, the output is pretty miserable:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 15:15:12.145: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 15:16:12.137: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 17 packets captured
Lab-6503E#show proc cpu hist

    1111444443333344444444443333333333333333333333333444443333
    4444000009999911111222229999999999999999999999999222229999
100
 90
 80
 70
 60
 50
 40     ******************************************************
 30     ******************************************************
 20     ******************************************************
 10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)
Again, the CPU took a hammering and we only captured a few of the ping packets - 17 out of 60.

Quirks / Order of Operations

Now you may think that software filters might be OK if we just reduce the rate-limit configured on the capture:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#rate-limit 100
This *does* do what we want for the CPU load - here's an example with a MAC filter:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 15:26:43.793: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 15:27:43.785: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 0 packets captured
Lab-6503E#show proc cpu hist

                    11111          11111          6666633333
100
 90
 80
 70
 60
 50
 40
 30
 20
 10                                               *****
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)
Great - nothing to see here. But also nothing to see in the capture buffer - 0 packets captured!
Just for fun let's try the same with a hardware ACL filter:
*Jul 11 15:25:26.921: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 60 packets captured
Why is this? Well, it's the order of operations. Basically the flow for hardware ACL filters is:
So the filter throws out the junk before the rate limiter, meaning that the rate limiter only counts the good stuff. If the good stuff exceeds the rate limit then you'll lose some of it but the junk doesn't count.
Compare that to the flow for the software filters:
The software filters are applied after the rate limiter, so clearly when the rate limit is exceeded you throw out a mix of good and bad traffic, then pick out what's left of the good. If your traffic is overwhelmingly garbage, you may not get any of the good stuff at all!

Summary - Play it Safe

So in answer to the question "is it safe to run a local capture on a production 6500" - packet capture on even a relatively modest SUP32-3B supervisor is pretty safe provided you are cautious. If you want to do this in a busy production environment then my message to you is:
  • Use ACLs where at all possible
  • Set the rate limit to a sensible value (the default 1000 fps is fine for most cases)
  • Use linear buffers of a sensible size (do you really need 64MB of capture?)
  • Limit the frame count or capture duration at first (it may turn out there is a lot more "interesting" traffic than you thought!)
I'm convinced enough to do this in production but obviously I'm only testing one device, on one release of code, so don't blame me if you try it on something different and encounter a bug!

No comments:

Post a Comment