BRKDCT-2333 -- Data Center Network Failure Detection
Presenter: Arkadiy Shapiro, Manager Technical Marketing (Nexus 2000 - 7000) @ArkadiyShapiro
"You could say I'm obsessed with BFD" -Arkadiy
Fast failure detection is the key to fast convergence.
Routing convergence steps:
- Detect
- Propagate (tell my neighbors)
- Process (routing recalc, SPF, DUAL, etc)
- Update (update RIB/FIB, program hardware tables)
Failure detection tools: a layered approach: Layer 1, 2, MPLS, 3, application.
Interconnect options:
- Point to point - failure detection is really easy here; event driven; fast
- Layer 3 with Layer 1 (DWDM) bump in the wire
- Layer 3 with Layer 2 (ethernet) bump in the wire
- Layer 3 with Layer 3 (firewall/router) bump in the wire
Think about this: moving to higher speeds (1G -> 10G -> 40G -> beyond) means that more data is lost as you move to higher speeds without changing the failure detection/reconvergence characteristics of the network. 1 second reconvergence time at 1G is way different than 1 second at 40G.
Be aware: ISSU may not support aggressive timers on various protocols. Another reason to be wary of timer cranking.
Side note: about 1/3 of the room is running FabricPath
Layer 1:
- One-way failure on a fiber link: used to need UDLD. 1G, 10G, 40G, so on no longer need this. Protocol will drop the Tx if the Rx side goes down (for example).
- Carrier delay - timer running in software on the routing platforms. Filters link up/down notifications. This behavior is not desirable for fast convergence (set it to zero).
- Debounce timer - delay link down notification; runs in firmware; standard switching platform feature; defaults to 100msec on NX-OS
- Debounce is typically the one you're more likely to work with in the data center
- Good slide in the deck comparing carrier delay and debounce timer
Layer 2:
- Spanning Tree Bridge Assurance - turns STP from unidirectional to bidirectional; fail closed rather than fail open; an absence of BPDUs will cause the port to be disabled.
- LACP - not just for configuration consistency, but also for failure detection (LACP BPDUs used for keepalives); also detects unidirectional links; capable of fast hellos however, not supported with ISSU
- UDLD - original use cases: STP loop prevention (now handled by Rapid STP, BA); STP fast convergence (now handled by BA); etherchannel misconfiguration (now handled by LACP). "UDLD is nearly useless in the data center today"
How much do you really need UDLD?
- Layer 1 - handled by auto neg, event driven failure detection
- Layer 2/soft failures - STP BA, RSTP
- Etherchannel misconfig/failure: LACP
- Chance of miswiring is small
- Layer 3 - point to point links; IGP hello timeouts
Link OAM
- IEEE 802.3ah
- Provides mechanisms for "monitoring link operation"
- Can continuously monitor link health (CRCs and so on) and take some action
- Not supported on Nexus today; ASR 9k
Layer 3:
- Is Layer 3 failure detection tuning necessary? It depends.
- Needed when: intermediate Layer 2 hop over Layer 3 hop; Concerns over software protocol failures; Concerns over unidirectional failures
- May not need when: p2p physical L3 links with no concern of unidirectional links; FHRPs are running in active/active mode (VPC, Anycast HSRP); enough software redundancy to account for protocol failures
- Tuning down L3 timers is not recommended. Makes configs complex (many protocols at the aggr layer); CPU load, very dangerous; not supported by ISSU; challenges achieving sub-second detection
BFD:
- Bidirectional Forwarding Detection (BFD): lightweight, designed from ground up for sub-second convergence; allows running one detection protocol (BFD) which other protocols subscribe to (HSRP, OSPF, BGP, PIM, so on); On NX-OS, supported with stateful restart, SSO, ISSU; can run in hardware; runs in interrupt context; Nexus 2000 ports do not support BFD
- BFD is offloaded to the line card CPU on n7k, n9500. In NX-OS 7.2, BFD will offload to the FSA hardware accelerator on the F3 line card (will allow for even faster failure detection)
- Bundles: how to test every single link in a port channel? BFD is sent over UDP, it will get hashed to just a single link in the channel. BFD Logical Mode: spray the transmitted packets across all links in the bundle; runs a single BFD session per L3 link. BFD Per-link mode: BFD session per port-channel member; n5k/7k/9k; proprietary feature; only Nexus-to-Nexus links (today); leader session on the Sup consolidates member states and communicates with clients
- BFD for FabricPath: FP IS-IS as BFD client
- BFD for OTV
- BFD for static routes
- BFD multihop - when BFD peers are not L2 adjacent; not on Nexus today; solution is to use IP SLA and hook it to some policy based routing rules (PBR is done in hardware on n7k)
Fabric Extenders
- "Satellite Discovery Protocol" runs between parent switch and FEX
- Doesn't run BFD
If one protocol can do the job, then one protocol might be all you need! (think: BFD). Keep it simple.