Saturday 23 January 2016

Cisco Nexus Output Errors

A little while ago I was asked to investigate an IP based storage problem which had been traced back to a large amount of output errors on the port facing a particular compute node. The port was on a Cisco Nexus 5000 series device and I could see that, while output errors were clocking up at a massive rate, the switch was giving me nothing to go on as to what kind of errors they were. Every one of the usual suspects (collisions, etc) on the port showed nothing and yet the output errors were clocking up.

The ultimate answer turned out to be related to the fact that the Nexus 5k aims for low latency and as such performs cut-through switching. If you're not familiar with this term, please refer to this reasonably decent Cisco explanation, however at a high level there are two possible modes of transmission in switched networks:

1 - Store and Forward, where the entire frame is buffered into memory, the FCS is validated and then the frame is passed on. This mode can handle ports of differing speeds but obviously for large frames the serialisation delay becomes significant.
2 - Cut through, where just the header is checked for source / destination, plus any fields required for QoS / ACLs, then the rest of the frame is "cut through" onto the appropriate output port without buffering. This requires ports of an identical speed but offers lower latency.

One of the not-immediately-obvious side effects of cut through switching is that the FCS is only validated once the frame has been passed, by which point it is too late to take any corrective action. Essentially, the forwarding switch has already passed a broken fame on and, although it knows this, it can do nothing about it in retrospect and so it just says "oh, well" and increments its error counters on the ingress and egress ports.

If you are seeing output errors on a port with no other real explanation of how they got there, check other ports of the same speed for input errors. In my case it was due to a fibre fault - corrupted frames were entering one port, being cut through to another and causing errors to clock up on both.

No comments:

Post a Comment