Troubleshooting Cisco Network Elements with the USE Method

The USE Method is a model for troubleshooting a system that is in distress when you don’t know exactly what the nature of the problem is.

For example, if users within a specific part of your network are complaining of slowness, disconnects and poor application performance, you can probably isolate your troubleshooting to 2-3 switches or routers. However, since the problem description is so vague (we all love the “it’s slow!” report, right? ūüôĄ), it’s hard to know where to start with detailed troubleshooting on those specific switches/routers.

That’s where the USE Method comes in.

I learned about the USE method while reading Brendan Gregg’s blog (http://www.brendangregg.com/usemethod.html). Brendan is a very skilled performance engineer specializing in UNIX systems.¬† To quote Brendan:

The USE Method can be summarized as: For every resource, check utilization, saturation, and errors.

  • Resource: all physical [network element] functional components (eg, CPU, memory)
  • Utilization: the average time that the resource was busy servicing work
  • Saturation: the degree to which the resource has extra work which it can’t service, often queued
  • Errors: the count of error events

In this post, I adapt the USE Method to Cisco network devices and show how their physical resources (CPUs, different areas and types of memory, interfaces, and more) can be methodically examined in the three dimensions of utilization, saturation, and errors. Be sure to read Brendan’s blog post to understand the logic behind the USE Method and to gain insight into how to apply it.

This is a living document and will be updated over time. Please get in touch and share your own methods on these and other platforms! (via the comments below or the contact page)

Table of Contents

Routers (ISRg2, ISR 4000, ASR 1000, CSR 1000V) running IOS or IOS-XE

 * the different colors denote the grouping of physical components
Component
Type
Metric
Control Plane CPU
Utilization
show proc cpu: “CPU utilization for five seconds: 5%/1%; one minute: 4%; five minutes: 2%”
General CPU business
5%/1% — 5% total/1% time spent in interrupt context (CEF switching)
show proc cpu history: histograms; where was the CPU at X seconds/minutes ago? What was peak vs average?
Control Plane CPU
Saturation
show proc cpu extended: Run queue lengths, response times
show logg | inc CPUHOG: Is a process sitting on the CPU too long, not willingly giving it up?
Control Plane CPU
Errors
?
Control Plane Memory
Utilization
show proc memory: used/free
“Processor” memory is for IOS, processes, etc
“I/O” memory is for storing packets while they’re being switched through the box
show proc memory sorted: process hogging memory? process memory leak?
Control Plane Memory
Saturation
show buffers, show buffers failures: allocation failures, “no memory” failures
show logg: malloc errors
Control Plane Memory
Errors
show logg
Execute Generic Online Diagnostics (GOLD) tests:
  • diagnostic start … (run memory test)
  • show diagnostic events
  • show diagnostic result …
Data Plane
(ISR 4k, ASR 1k, CSR1000V)
Utilization
show platform hardware qfp active datapath utilization
The aggregate in/out data plane utilization
Data Plane
(ISR 4k, ASR 1k, CSR1000V)
Errors
show platform hardware qfp active statistics drop detail
Packet drop reasons and counters

 UADP-Based Catalyst Switches (3650, 3850, 4500E Sup8E) running IOS-XE

  * the different colors denote the grouping of physical components
Component
Type
Metric
Control Plane CPU (IOS-XE/Linux processes)
Utilization
show processes cpu detailed
show processes cpu detailed | exclude 0.00 (processes with non-zero CPU utilization)
Control Plane CPU (iosd threads)
Utilization
show processes cpu detailed process iosd sorted
Control Plane CPU
Saturation
show platform punt statistics port-asic 0 cpuq -1 direction rx
  • Number of port-asics depends on platform type and model
  • “cpuq -1” will list all queues; if you know the specific queue you want to view, substitute its value
  • Look at “dropped” counters
  • Look for high packet rate
CPU Punt Path Architecture on UADP-Based Switches //Cisco Live BRKCRS-3146
CPU Punt Path Architecture on UADP-Based Switches /Cisco Live BRKCRS-3146
show platform punt client
  • Look for high number of packets in a queue over multiple runs of the command
  • Look for incrementing counters in the “failures” columns
show pds tag all | include Active|Tags|<queue#> (reveals some stats and the name of the queue)
Decoding CPU Queues on UADP-Based Switches //Cisco Live BRKCRS-3146
Decoding CPU Queues on UADP-Based Switches
show platform punt tx
show logg | inc CPUHOG: Is a process sitting on the CPU too long, not willingly giving it up?
Control Plane CPU
Errors
?
Control Plane Memory (IOS-XE/Linux processes)
Utilization
show processes memory sorted (sorts by RSS, descending)
Control Plane Memory (iosd process memory)
Utilization
show processes memory detailed process iosd sorted
Control Plane Memory
Saturation
show buffers: allocation failures, “no memory” failures
show logg: malloc errors
Control Plane Memory
Errors
show logg

Execute Generic Online Diagnostics (GOLD) tests:

  • diagnostic start … (run memory test)
  • show diagnostic events
  • show diagnostic result …
Data Plane TCAM
Utilization
show platform tcam utilization asic all
Data Plane TCAM
Saturation
show logg (look for messages indicating TCAM is full: MAC addresses can’t be learned; ACEs cannot be installed in hardware)
Data Plane TCAM
Errors
?

Catalyst 6500/6800 Series

  * the different colors denote the grouping of physical components
Component
Type
Metric
Control Plane CPU
Utilization
show proc cpu: CPU utilization for five seconds: 5%/1%; one minute: 4%; five minutes: 2%
General CPU business
5%/1% — 5% total/1% time spent in interrupt context (CEF switching)
show proc cpu history: histograms; where was the CPU at X seconds/minutes ago? What was peak vs average?
Data Plane (Switch Fabric)
Utilization
show fabric utilization all
Data Plane (Switch Fabric)
Errors
show fabric channel-counters
Data Plane (TCAM)
Utilization
show platform hardware capacity pfc
show tcam counts
Data Plane (TCAM)
Saturation
show¬†platform¬†hardware¬†capacity¬†pfc”show tcam counts
Disclaimer: The opinions and information expressed in this blog article are my own and not necessarily those of Cisco Systems.

Leave a Reply

Your email address will not be published. Required fields are marked *

Would you like to subscribe to email notification of new comments? You can also subscribe without commenting.

Networking. Unix. Cyber Security. Code. Protocols. System Design.