Tag Archives: cisco

On Why I’m Shifting my Career Focus to Software

For the past few months I’ve been involved in a case study project with some colleagues at Cisco where we’ve been researching what the most relevant software skills are that Cisco’s pre-sales engineers could benefit from. We’re all freaking experts at Outlook of course (that’s a joke 🤬) but we were interested in the areas of programming, automation, orchestration, databases, analytics, and so on. The end goal of the project was to identify what those relevant skills are, have a plan to identify the current skillset in the field, do that gap analysis and then put forward recommendations on how to close the gap.

This probably sounds really boring and dry, and I don’t blame you for thinking that, but I actually chose this case study topic from a list of 8 or so. My motivation was largely selfish: I wanted to see first-hand the outcome of this project because I wanted to know how best to align my own training, study, and career in the software arena. I already believed that to stay relevant as my career moves along that software skills would be essential. It was just a question of what type of skills and in which specific areas.

Continue reading On Why I’m Shifting my Career Focus to Software

The Anatomy of a Cisco Spark Bot

I spent a long time creating my first Spark bot, Zpark. The first commit was in August and the first release was posted in January. So, six months elapsed time. It’s also over-engineered. I mean, all it does is post messages back and forth between a back-end system and some Spark spaces and I ended up with something so complex that I had to draw a damn block diagram in the user guide to give people a fighting chance at comprehending how it works.

Its internals could’ve been much simpler. But that was part of the point of creating the bot: examining the proper architecture for a scalable application, learning about new technologies for building my own API, learning about message brokers, pulling my hair out over git’s eccentricities and ultimately, having enough material to write this blog post.

In this post I’m going to break down the different functional components of Zpark, discuss what each does, and why–or not–that component is necessary. If I can achieve one goal, it will be to retire to a tropical island ASAP. If I can achieve a second goal, it will be to give aspiring bot creaters (like yourself, presumably) a strong mental model of a Spark bot to aid their development.

Continue reading The Anatomy of a Cisco Spark Bot

Explain Cisco ETA to Me in a Way That Even My Neighbor Can Understand It

Cisco Encrypted Traffic Analytics (ETA) sounds just a little bit like magic the first time you hear about it. Cisco is basically proposing that when you turn on ETA, your network can (magically!) detect malicious traffic (ie, malware, trojans, ransomware, etc) inside encrypted flows. Further, Cisco proposes that ETA can differentiate legitimate encrypted traffic from malicious encrypted traffic.

Uhmm, how?

The immediate mental model that springs to mind is that of a web proxy that intercepts HTTP traffic. In order to intercept TLS-encrypted HTTPS traffic, there’s a complicated dance that has to happen around building a Certificate Authority, distributing the CA’s public certificate to every device that will connect through the proxy and then actually configuring the endpoints and/or network to push the HTTPS traffic to the proxy. This is often referred to as “man-in-the-middle” (MiTM) because the proxy actually breaks into the encrypted session between the client and the server. In the end, the proxy has access to the clear-text communication.

Is ETA using a similar method and breaking into the encrypted session?

In this article, I’m going to use an analogy to describe how ETA does what it does. Afterwards, you should feel more comfortable about how ETA works and not be worried about any magic taking place in your network. 🧙

Continue reading Explain Cisco ETA to Me in a Way That Even My Neighbor Can Understand It

Say Hello to Zpark, my Cisco Spark Bot

For a long while now I’ve been brainstorming how I could leverage the API that’s present in the Cisco Spark collaboration platform to create a bot. There are lots of goofy and fun examples of bots (ie, Gifbot) that I might be able to draw inspiration from, but I wanted to create something that would provide high value to myself and anyone else that choose to download and use it. The idea finally hit me after I started using Zabbix for system monitoring. Since Zabbix also has a feature-rich API, all the pieces were in place to create a bot that would act as a bit of middle-ware between Zabbix and Spark. I call the bot: Zpark.

Continue reading Say Hello to Zpark, my Cisco Spark Bot

Lifting the Hood on Cisco Software Defined Access

If you’re an IT professional and you have at least a minimal awareness of what Cisco is doing in the market and you don’t live under a rock, you would’ve heard about the major launch that took place in June: “The network. Intuitive.” The anchor solution to this launch is Cisco’s Software Defined Access (SDA) in which the campus network becomes automated, highly secure, and highly scalable.

The launch of SDA is what’s called a “Tier 1” launch where Cisco’s corporate marketing muscle is fully exercised in order to generate as much attention and interest as possible. As a result, there’s a lot of good high-level material floating around right now around SDA. What I’m going to do in this post is lift the hood on the solution and explain what makes the SDA network fabric actually work.

Continue reading Lifting the Hood on Cisco Software Defined Access

Troubleshooting Cisco Network Elements with the USE Method

I want to draw some attention to a new document I’ve written titled “Troubleshooting Cisco Network Elements with the USE Method“. In it, I explain how I’ve taken a model for troubleshooting a complex system–the USE Method, by Brendan Gregg–and applied it to Cisco network devices. By applying the USE Method, a network engineer can perform methodical troubleshooting of a network element in order to determine why the NE is not performing/acting/functioning as it should.

I ask that if you’re familiar with a given Cisco network platform (or platforms), that you please contribute commands that would also fit into the USE Method! My list is just a start and I welcome contributions from others in order to make it a stronger, more valuable reference.

Please check out the guide: Troubleshooting Cisco Network Elements with the USE Method

Troubleshooting Cisco Network Elements with the USE Method

The USE Method is a model for troubleshooting a system that is in distress when you don’t know exactly what the nature of the problem is.

For example, if users within a specific part of your network are complaining of slowness, disconnects and poor application performance, you can probably isolate your troubleshooting to 2-3 switches or routers. However, since the problem description is so vague (we all love the “it’s slow!” report, right? 🙄), it’s hard to know where to start with detailed troubleshooting on those specific switches/routers.

That’s where the USE Method comes in.

I learned about the USE method while reading Brendan Gregg’s blog (http://www.brendangregg.com/usemethod.html). Brendan is a very skilled performance engineer specializing in UNIX systems.  To quote Brendan:

The USE Method can be summarized as: For every resource, check utilization, saturation, and errors.

  • Resource: all physical [network element] functional components (eg, CPU, memory)
  • Utilization: the average time that the resource was busy servicing work
  • Saturation: the degree to which the resource has extra work which it can’t service, often queued
  • Errors: the count of error events

In this post, I adapt the USE Method to Cisco network devices and show how their physical resources (CPUs, different areas and types of memory, interfaces, and more) can be methodically examined in the three dimensions of utilization, saturation, and errors. Be sure to read Brendan’s blog post to understand the logic behind the USE Method and to gain insight into how to apply it.

This is a living document and will be updated over time. Please get in touch and share your own methods on these and other platforms! (via the comments below or the contact page)

Table of Contents

Routers (ISRg2, ISR 4000, ASR 1000, CSR 1000V) running IOS or IOS-XE

 * the different colors denote the grouping of physical components
Component
Type
Metric
Control Plane CPU
Utilization
show proc cpu: “CPU utilization for five seconds: 5%/1%; one minute: 4%; five minutes: 2%”
General CPU business
5%/1% — 5% total/1% time spent in interrupt context (CEF switching)
show proc cpu history: histograms; where was the CPU at X seconds/minutes ago? What was peak vs average?
Control Plane CPU
Saturation
show proc cpu extended: Run queue lengths, response times
show logg | inc CPUHOG: Is a process sitting on the CPU too long, not willingly giving it up?
Control Plane CPU
Errors
?
Control Plane Memory
Utilization
show proc memory: used/free
“Processor” memory is for IOS, processes, etc
“I/O” memory is for storing packets while they’re being switched through the box
show proc memory sorted: process hogging memory? process memory leak?
Control Plane Memory
Saturation
show buffers, show buffers failures: allocation failures, “no memory” failures
show logg: malloc errors
Control Plane Memory
Errors
show logg
Execute Generic Online Diagnostics (GOLD) tests:
  • diagnostic start … (run memory test)
  • show diagnostic events
  • show diagnostic result …
Data Plane
(ISR 4k, ASR 1k, CSR1000V)
Utilization
show platform hardware qfp active datapath utilization
The aggregate in/out data plane utilization
Data Plane
(ISR 4k, ASR 1k, CSR1000V)
Errors
show platform hardware qfp active statistics drop detail
Packet drop reasons and counters

 UADP-Based Catalyst Switches (3650, 3850, 4500E Sup8E) running IOS-XE

  * the different colors denote the grouping of physical components
Component
Type
Metric
Control Plane CPU (IOS-XE/Linux processes)
Utilization
show processes cpu detailed
show processes cpu detailed | exclude 0.00 (processes with non-zero CPU utilization)
Control Plane CPU (iosd threads)
Utilization
show processes cpu detailed process iosd sorted
Control Plane CPU
Saturation
show platform punt statistics port-asic 0 cpuq -1 direction rx
  • Number of port-asics depends on platform type and model
  • “cpuq -1” will list all queues; if you know the specific queue you want to view, substitute its value
  • Look at “dropped” counters
  • Look for high packet rate
CPU Punt Path Architecture on UADP-Based Switches //Cisco Live BRKCRS-3146
CPU Punt Path Architecture on UADP-Based Switches /Cisco Live BRKCRS-3146
show platform punt client
  • Look for high number of packets in a queue over multiple runs of the command
  • Look for incrementing counters in the “failures” columns
show pds tag all | include Active|Tags|<queue#> (reveals some stats and the name of the queue)
Decoding CPU Queues on UADP-Based Switches //Cisco Live BRKCRS-3146
Decoding CPU Queues on UADP-Based Switches
show platform punt tx
show logg | inc CPUHOG: Is a process sitting on the CPU too long, not willingly giving it up?
Control Plane CPU
Errors
?
Control Plane Memory (IOS-XE/Linux processes)
Utilization
show processes memory sorted (sorts by RSS, descending)
Control Plane Memory (iosd process memory)
Utilization
show processes memory detailed process iosd sorted
Control Plane Memory
Saturation
show buffers: allocation failures, “no memory” failures
show logg: malloc errors
Control Plane Memory
Errors
show logg

Execute Generic Online Diagnostics (GOLD) tests:

  • diagnostic start … (run memory test)
  • show diagnostic events
  • show diagnostic result …
Data Plane TCAM
Utilization
show platform tcam utilization asic all
Data Plane TCAM
Saturation
show logg (look for messages indicating TCAM is full: MAC addresses can’t be learned; ACEs cannot be installed in hardware)
Data Plane TCAM
Errors
?

Catalyst 6500/6800 Series

  * the different colors denote the grouping of physical components
Component
Type
Metric
Control Plane CPU
Utilization
show proc cpu: CPU utilization for five seconds: 5%/1%; one minute: 4%; five minutes: 2%
General CPU business
5%/1% — 5% total/1% time spent in interrupt context (CEF switching)
show proc cpu history: histograms; where was the CPU at X seconds/minutes ago? What was peak vs average?
Data Plane (Switch Fabric)
Utilization
show fabric utilization all
Data Plane (Switch Fabric)
Errors
show fabric channel-counters
Data Plane (TCAM)
Utilization
show platform hardware capacity pfc
show tcam counts
Data Plane (TCAM)
Saturation
show platform hardware capacity pfc”show tcam counts

L3 vPC Support on Nexus 5k

So… I’m a little embarrased to admit this but I only very recently found out that there are significant differences in how Virtual Port Channels (vPC) behave on the Nexus 5k vs the Nexus 7k when it comes to forming routing adjacencies over the vPC.

Take the title literally!
Take the title literally!

I’ve read the vPC Best Practice whitepaper and have often referred
others to it and also referred back to it myself from time to time. What I failed to realize is that I should’ve been taking the title of this paper more literally: it is 100% specific to the Nexus 7k. The behaviors the paper describes, particularly around the data plane loop prevention protections for packets crossing the vPC peer-link, are specific to the n7k and are not necessarily repeated on the n5k.

Continue reading L3 vPC Support on Nexus 5k

Cisco DevNet Scavenger Hunt at GSX 17

At Cisco’s GSX conference at the start of FY17, the DevNet team made a programming scavenger hunt by posting daily challenges that required using things like containers, Cisco Shipped, Python, and RESTful APIs in Cisco software in order to solve puzzles. In order to submit an answer, the team created an API that contestants had to use (in effect creating another challenge that contestants had to solve).

This post contains the artifacts I created while solving some of the challenges.

Continue reading Cisco DevNet Scavenger Hunt at GSX 17

NSF and GR on Nexus 5000

NSF and GR are two features in Layer 3 network elements (NEs) that allows two adjacent elements to work together when one of them undergoes a control plane switchover or control plane restart.

The benefit is that when a control plane switchover/restart occurs, the impact to network traffic is kept to a minimum and in most cases, to zero.

Continue reading NSF and GR on Nexus 5000