BRKDCT-2333 -- Data Center Network Failure Detection

Presenter: Arkadiy Shapiro, Manager Technical Marketing (Nexus 2000 - 7000) @ArkadiyShapiro

"You could say I'm obsessed with BFD" -Arkadiy

Fast failure detection is the key to fast convergence.

Routing convergence steps:

Detect
Propagate (tell my neighbors)
Process (routing recalc, SPF, DUAL, etc)
Update (update RIB/FIB, program hardware tables)

Failure detection tools: a layered approach: Layer 1, 2, MPLS, 3, application.

Interconnect options:

Point to point - failure detection is really easy here; event driven; fast
Layer 3 with Layer 1 (DWDM) bump in the wire
Layer 3 with Layer 2 (ethernet) bump in the wire
Layer 3 with Layer 3 (firewall/router) bump in the wire

Think about this: moving to higher speeds (1G -> 10G -> 40G -> beyond) means that more data is lost as you move to higher speeds without changing the failure detection/reconvergence characteristics of the network. 1 second reconvergence time at 1G is way different than 1 second at 40G.

Be aware: ISSU may not support aggressive timers on various protocols. Another reason to be wary of timer cranking.

Side note: about 1/3 of the room is running FabricPath

Layer 1:

One-way failure on a fiber link: used to need UDLD. 1G, 10G, 40G, so on no longer need this. Protocol will drop the Tx if the Rx side goes down (for example).
Carrier delay - timer running in software on the routing platforms. Filters link up/down notifications. This behavior is not desirable for fast convergence (set it to zero).
Debounce timer - delay link down notification; runs in firmware; standard switching platform feature; defaults to 100msec on NX-OS
Debounce is typically the one you're more likely to work with in the data center
Good slide in the deck comparing carrier delay and debounce timer

Layer 2:

Spanning Tree Bridge Assurance - turns STP from unidirectional to bidirectional; fail closed rather than fail open; an absence of BPDUs will cause the port to be disabled.
LACP - not just for configuration consistency, but also for failure detection (LACP BPDUs used for keepalives); also detects unidirectional links; capable of fast hellos however, not supported with ISSU
UDLD - original use cases: STP loop prevention (now handled by Rapid STP, BA); STP fast convergence (now handled by BA); etherchannel misconfiguration (now handled by LACP). "UDLD is nearly useless in the data center today"

How much do you really need UDLD?

Layer 1 - handled by auto neg, event driven failure detection
Layer 2/soft failures - STP BA, RSTP
Etherchannel misconfig/failure: LACP
Chance of miswiring is small
Layer 3 - point to point links; IGP hello timeouts

Link OAM

IEEE 802.3ah
Provides mechanisms for "monitoring link operation"
Can continuously monitor link health (CRCs and so on) and take some action
Not supported on Nexus today; ASR 9k

Layer 3:

Is Layer 3 failure detection tuning necessary? It depends.
Needed when: intermediate Layer 2 hop over Layer 3 hop; Concerns over software protocol failures; Concerns over unidirectional failures
May not need when: p2p physical L3 links with no concern of unidirectional links; FHRPs are running in active/active mode (VPC, Anycast HSRP); enough software redundancy to account for protocol failures
Tuning down L3 timers is not recommended. Makes configs complex (many protocols at the aggr layer); CPU load, very dangerous; not supported by ISSU; challenges achieving sub-second detection

BFD:

Bidirectional Forwarding Detection (BFD): lightweight, designed from ground up for sub-second convergence; allows running one detection protocol (BFD) which other protocols subscribe to (HSRP, OSPF, BGP, PIM, so on); On NX-OS, supported with stateful restart, SSO, ISSU; can run in hardware; runs in interrupt context; Nexus 2000 ports do not support BFD
BFD is offloaded to the line card CPU on n7k, n9500. In NX-OS 7.2, BFD will offload to the FSA hardware accelerator on the F3 line card (will allow for even faster failure detection)
Bundles: how to test every single link in a port channel? BFD is sent over UDP, it will get hashed to just a single link in the channel. BFD Logical Mode: spray the transmitted packets across all links in the bundle; runs a single BFD session per L3 link. BFD Per-link mode: BFD session per port-channel member; n5k/7k/9k; proprietary feature; only Nexus-to-Nexus links (today); leader session on the Sup consolidates member states and communicates with clients
BFD for FabricPath: FP IS-IS as BFD client
BFD for OTV
BFD for static routes
BFD multihop - when BFD peers are not L2 adjacent; not on Nexus today; solution is to use IP SLA and hook it to some policy based routing rules (PBR is done in hardware on n7k)

Fabric Extenders

"Satellite Discovery Protocol" runs between parent switch and FEX
Doesn't run BFD

If one protocol can do the job, then one protocol might be all you need! (think: BFD). Keep it simple.

BRKDCT-2333 -- Data Center Network Failure Detection

Related posts