Test Setup
For my test setup I used a 6504-E switch with a modest SUP32-GE-3B supervisor and a 12.2 Advanced IP Services IOS - if it works on that, it should be safe anywhere. For traffic, I used a spare server running Ubuntu with a combination of Ostinato and Scapy.The configuration was as follows:
Running at full tilt, Ostinato was happily producing 1 Gbps of traffic which went into my node on VLAN 100, out around the loop cable, back into the node and then out of the same interface on VLAN 101. Basically the port is running at 1 Gbps in each direction, so the worst possible case for mirroring a gig port.
Default Settings
The default settings for the capture are pretty conservative - a tiny 2 MB linear capture buffer with a rate limit of 10,000 frames per second. With this config, a 1 Gbps stream of 1500 byte packets fills the buffer in ~ 2.5s, triggering the capture to end. The impact of this is almost impossible to measure at all, with the capture being over so quickly you may not actually see any change in CPU on the 5 second roll-ups.Lab-6503E#monitor capture start *Jul 11 14:30:59.205: %SPAN-5-PKTCAP_START: Packet capture session 1 started Lab-6503E# *Jul 11 14:31:01.449: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended as the buffer is full, 21845 packets captured Lab-6503E#show proc cpu hist 222223333322222 8888811 100 90 80 70 60 50 40 30 20 10 ***** 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds)
Worst Case
OK, so the world didn't end. The next step was to see how bad it could be so I made the following changes:- Increased the rate limit to 100,000 fps (max)
- Increased the packet buffer to 64 MB (max)
- Enabled a circular buffer (why?!)
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec *Jul 11 14:45:02.953: %SPAN-5-PKTCAP_START: Packet capture session 1 started *Jul 11 14:46:02.945: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 699040 packets captured Lab-6503E#show proc cpu hist 2222244444444444444455555444444444444444444444444444444444 9999999999888889999966666999999999999999999998888899999999 100 90 80 70 60 ***** 50 ***************************************************** 40 ***************************************************** 30 ********************************************************** 20 ********************************************************** 10 ********************************************************** 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds)The impact to the supervisor in this case was much more noticeable - up to 60% CPU utilisation. The scenario is pretty unrealistic (forget circular buffers!) but suffice to say I wouldn't want to do that on a production device.
Now we have a the best and the worst cases, let's look at some realistic use cases and explore some of the other capture parameters that might help us capture what we need without causing havoc on the network.
Narrowing Down the Capture
It's possible to define criteria to decide what gets captured - as the CLI points out, some of these criteria are processed in hardware while others are handled in software:Lab-6503E(config-mon-capture)#filter ? access-group Filter access-list (hardware based) ethertype Matching ethertype (software based) length Matching L2-packet length (software based) mac-address Matching mac-address (software-based) vlan Filter vlan (hardware based)Our test traffic consists of nearly a gig of junk run alongside a small 1-per-second ping, which we will decide is "interesting" to us and we want to capture. This traffic profile makes it easy to test ACL, MAC address and length filters and their relative performances.
ACL Filter
As the CLI says, ACL filters are applied in hardware so the junk is discarded at source before it hits the CPU. In this example I set up an ACL as follows:Lab-6503E(config)#ip access-list extended icmp-only Lab-6503E(config-ext-nacl)#permit icmp any any Lab-6503E(config-ext-nacl)#deny ip any any... and applied it to my capture as follows:
Lab-6503E(config)#monitor session 1 Lab-6503E(config-mon-capture)#filter access-group icmp-onlyNow, I re-ran the "worst case" test, with massively different results:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec *Jul 11 14:49:00.345: %SPAN-5-PKTCAP_START: Packet capture session 1 started *Jul 11 14:50:00.337: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 60 packets captured Lab-6503E#show proc cpu hist 22222222223333355555 11111 100 90 80 70 60 50 40 30 20 10 ***** 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds)Two things to note here - because this is a hardware filter, applied in the ASICs:
- Only 1 packet per second was punted to the CPU, resulting in essentially no hit at all
- All 60 of the ping packets were received and nothing else
MAC Filter
In contrast, the MAC filter runs in software. It's also pretty limited, only matching on source MAC. I repeated the above test but instead of an ACL match, applied a MAC filter as follows:Lab-6503E(config)#monitor session 1 Lab-6503E(config-mon-capture)#filter mac-address 0011.2233.4455This took us more-or-less back to the worst case:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec *Jul 11 15:05:55.197: %SPAN-5-PKTCAP_START: Packet capture session 1 started *Jul 11 15:06:55.189: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 5 packets captured Lab-6503E#show proc cpu hist 44444444443333344444333334444444444444444444444444333333 4422222111119999900000999990000011111333331111111111888889 100 90 80 70 60 50 40 ******************************************************** 30 ******************************************************** 20 ******************************************************** 10 ******************************************************** 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds)Yuck. I wouldn't do that in production. This is basically because the software filters are applied after the packets have been punted to the CPU, so you pay a penalty for the garbage as well as the good stuff. You'll notice that it only captured 5 packets as well, more on this later but that's another side effect of software filters.
Length Filter
The frame length filter is another software-based mechanism, which means it's pretty terrible under load, too. Our junk traffic consists of large frames, our interesting traffic is small, so let's configure the capture to only catch the short frames:Lab-6503E(config)#monitor session 1 Lab-6503E(config-mon-capture)#filter length 0 100Again, the output is pretty miserable:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec *Jul 11 15:15:12.145: %SPAN-5-PKTCAP_START: Packet capture session 1 started *Jul 11 15:16:12.137: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 17 packets captured Lab-6503E#show proc cpu hist 1111444443333344444444443333333333333333333333333444443333 4444000009999911111222229999999999999999999999999222229999 100 90 80 70 60 50 40 ****************************************************** 30 ****************************************************** 20 ****************************************************** 10 ********************************************************** 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds)Again, the CPU took a hammering and we only captured a few of the ping packets - 17 out of 60.
Quirks / Order of Operations
Now you may think that software filters might be OK if we just reduce the rate-limit configured on the capture:Lab-6503E(config)#monitor session 1 Lab-6503E(config-mon-capture)#rate-limit 100This *does* do what we want for the CPU load - here's an example with a MAC filter:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec *Jul 11 15:26:43.793: %SPAN-5-PKTCAP_START: Packet capture session 1 started *Jul 11 15:27:43.785: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 0 packets captured Lab-6503E#show proc cpu hist 11111 11111 6666633333 100 90 80 70 60 50 40 30 20 10 ***** 0....5....1....1....2....2....3....3....4....4....5....5.... 0 5 0 5 0 5 0 5 0 5 CPU% per second (last 60 seconds)Great - nothing to see here. But also nothing to see in the capture buffer - 0 packets captured!
Just for fun let's try the same with a hardware ACL filter:
*Jul 11 15:25:26.921: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 60 packets capturedWhy is this? Well, it's the order of operations. Basically the flow for hardware ACL filters is:
So the filter throws out the junk before the rate limiter, meaning that the rate limiter only counts the good stuff. If the good stuff exceeds the rate limit then you'll lose some of it but the junk doesn't count.
Compare that to the flow for the software filters:
The software filters are applied after the rate limiter, so clearly when the rate limit is exceeded you throw out a mix of good and bad traffic, then pick out what's left of the good. If your traffic is overwhelmingly garbage, you may not get any of the good stuff at all!
Summary - Play it Safe
So in answer to the question "is it safe to run a local capture on a production 6500" - packet capture on even a relatively modest SUP32-3B supervisor is pretty safe provided you are cautious. If you want to do this in a busy production environment then my message to you is:- Use ACLs where at all possible
- Set the rate limit to a sensible value (the default 1000 fps is fine for most cases)
- Use linear buffers of a sensible size (do you really need 64MB of capture?)
- Limit the frame count or capture duration at first (it may turn out there is a lot more "interesting" traffic than you thought!)
No comments:
Post a Comment