BRKDCT-2333 – Data Center Network Failure Detection

Presenter: Arkadiy Shapiro, Manager Technical Marketing (Nexus 2000 – 7000) @ArkadiyShapiro

You could say I’m obsessed with BFD –Arkadiy

The focus on this session is around failure detection (not reconvergence, protocol tuning, etc). This session will not go over user-driven failure detection methods (ping, traceroutes, etc).

Fast failure detection is the key to fast convergence.

Routing convergence steps:

  1. Detect
  2. Propagate (tell my neighbors)
  3. Process (routing recalc, SPF, DUAL, etc)
  4. Update (update RIB/FIB, program hardware tables)

Failure detection tools: a layered approach: Layer 1, 2, MPLS, 3, application.

Interconnect options:

  • Point to point – failure detection is really easy here; event driven; fast
  • Layer 3 with Layer 1 (DWDM) bump in the wire
  • Layer 3 with Layer 2 (ethernet) bump in the wire
  • Layer 3 with Layer 3 (firewall/router) bump in the wire

Think about this: moving to higher speeds (1G -> 10G -> 40G -> beyond) means that more data is lost as you move to higher speeds without changing the failure detection/reconvergence characteristics of the network. 1 second reconvergence time at 1G is way different than 1 second at 40G.

Be aware: ISSU may not support aggressive timers on various protocols. Another reason to be wary of timer cranking.

Side note: about 1/3 of the room is running FabricPath

Layer 1:

  • One-way failure on a fiber link: used to need UDLD. 1G, 10G, 40G, so on no longer need this. Protocol will drop the Tx if the Rx side goes down (for example).
  • Carrier delay – timer running in software on the routing platforms. Filters link up/down notifications. This behavior is not desirable for fast convergence (set it to zero).
  • Debounce timer – delay link down notification; runs in firmware; standard switching platform feature; defaults to 100msec on NX-OS
  • Debounce is typically the one you’re more likely to work with in the data center
  • Good slide in the deck comparing carrier delay and debounce timer

Layer 2:

  • Spanning Tree Bridge Assurance – turns STP from unidirectional to bidirectional; fail closed rather than fail open; an absence of BPDUs will cause the port to be disabled.
  • LACP – not just for configuration consistency, but also for failure detection (LACP BPDUs used for keepalives); also detects unidirectional links; capable of fast hellos however, not supported with ISSU
  • UDLD – original use cases: STP loop prevention (now handled by Rapid STP, BA); STP fast convergence (now handled by BA); etherchannel misconfiguration (now handled by LACP). “UDLD is nearly useless in the data center today”

How much do you really need UDLD?

  • Layer 1 – handled by auto neg, event driven failure detection
  • Layer 2/soft failures – STP BA, RSTP
  • Etherchannel misconfig/failure: LACP
  • Chance of miswiring is small
  • Layer 3 – point to point links; IGP hello timeouts

Link OAM

  • IEEE 802.3ah
  • Provides mechanisms for “monitoring link operation”
  • Can continuously monitor link health (CRCs and so on) and take some action
  • Not supported on Nexus today; ASR 9k

Layer 3:

  • Is Layer 3 failure detection tuning necessary? It depends.
  • Needed when: intermediate Layer 2 hop over Layer 3 hop; Concerns over software protocol failures; Concerns over unidirectional failures
  • May not need when: p2p physical L3 links with no concern of unidirectional links; FHRPs are running in active/active mode (VPC, Anycast HSRP); enough software redundancy to account for protocol failures
  • Tuning down L3 timers is not recommended. Makes configs complex (many protocols at the aggr layer); CPU load, very dangerous; not supported by ISSU; challenges achieving sub-second detection

BFD:

  • Bidirectional Forwarding Detection (BFD): lightweight, designed from ground up for sub-second convergence; allows running one detection protocol (BFD) which other protocols subscribe to (HSRP, OSPF, BGP, PIM, so on); On NX-OS, supported with stateful restart, SSO, ISSU; can run in hardware; runs in interrupt context; Nexus 2000 ports do not support BFD
  • BFD is offloaded to the line card CPU on n7k, n9500. In NX-OS 7.2, BFD will offload to the FSA hardware accelerator on the F3 line card (will allow for even faster failure detection)
  • Bundles: how to test every single link in a port channel? BFD is sent over UDP, it will get hashed to just a single link in the channel. BFD Logical Mode: spray the transmitted packets across all links in the bundle; runs a single BFD session per L3 link. BFD Per-link mode: BFD session per port-channel member; n5k/7k/9k; proprietary feature; only Nexus-to-Nexus links (today); master session on the Sup consolidates member states and communicates with clients
  • BFD for FabricPath: FP IS-IS as BFD client
  • BFD for OTV
  • BFD for static routes
  • BFD multihop – when BFD peers are not L2 adjacent; not on Nexus today; solution is to use IP SLA and hook it to some policy based routing rules (PBR is done in hardware on n7k)

Fabric Extenders

  • “Satellite Discovery Protocol” runs between parent switch and FEX
  • Doesn’t run BFD

If one protocol can do the job, then one protocol might be all you need! (think: BFD). Keep it simple.

One thought on “BRKDCT-2333 – Data Center Network Failure Detection”

Leave a Reply

Your email address will not be published. Required fields are marked *

Would you like to subscribe to email notification of new comments? You can also subscribe without commenting.