The Importance of BGP NEXT_HOP in L3VPNs

In an MPLS network with L3VPNs, it's very easy for the NEXT_HOP attribute of a VPN route to look absolutely correct but be very wrong at the same time. In a vanilla IP network, the NEXT_HOP can point to any IP address that gets the packets moving in the right direction towards the ultimate destination. In an MPLS network, the NEXT_HOP must get the packets moving in the right direction but it must also point to the exact right address in order for traffic to successfully reach the destination.

The reason it has to be exact is because IOS only assigns MPLS labels to the next hop address and not to each individual VPN route. So when an ingress PE needs to forward a packet from a CE across the MPLS network, the PE finds the label associated with the NEXT_HOP address and uses that as the outer label to get the packet to the egress PE.

Since each NEXT_HOP has a different label, that means each NEXT_HOP is reachable through a different Label Switched Path (LSP). Different LSPs can, and likely will, forward traffic differently through the network.

An MPLS label identifies a Forwarding Equivalence Class (FEC). A FEC is a grouping of packets that will all receive the same treatment from the network and will all be forwarded along the same path. So therefore by definition, if two PEs have two LSPs between them, the treatment and forwarding of packets through the network is going to be different on each path. What I'll show below is that only one of those paths gets the packets to the ultimate destination.

So what is the correct NEXT_HOP value? It's the loopback address of the egress PE.

Here's an example:

The user attached to R2 is sending traffic to the server (192.168.100.8) attached to R7. R2 and R7 are MP-BGP peers; R2 is using its loopback address and R7 is using its ethernet0/0 address as the peering addresses. The BGP peers are established, LDP is operational on each link, and routes are being learned successfully between R2 and R7.

Just so there's no assumptions: this is not a scenario from the CCIE lab exam. It might've been a scenario from a practice exam; I don't actually remember any more where this came from. I documented it very precisely in my study notes because I thought it would be a devilishly clever troubleshooting task.

By looking in the BGP table of R2, we can see the NEXT_HOP for 192.168.100.8 points to R7's ethernet0/0 address (10.2.67.7) (last octet in this scenario == router number):

R2# show bgp vpnv4 unicast vrf BRANCHES 192.168.100.8
BGP routing table entry for 200:1:192.168.100.0/24, version 32
Paths: (1 available, best #1, table BRANCHES)
  Advertised to update-groups:
     3
  Refresh Epoch 1
  Local
    10.2.67.7 (metric 30) from 10.2.67.7 (10.1.7.7)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:200:1
      mpls labels in/out nolabel/26
      rx pathid: 0, tx pathid: 0x0

Since R2 is the ingress PE, we know it will put the user's traffic onto an LSP so that it can be label switched through the network core (ie, R3 and R6). Let's check the Label Forwarding Information Base (LFIB) to see the value of the outer label that R2 will apply. As stated above, labels are assigned to the NEXT_HOP addresses and not to VPN routes, so the lookup is done on 10.2.67.7:

R2# show mpls forwarding-table 10.2.67.7
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface
None       19         10.2.67.7/32   0             Et0/0      10.2.23.3

R2 will apply a label value of 19.

So far nothing appears amiss and we have no indication one way or the other if the user is able to reach the server.

If we move to R3 and look at the LFIB for label 19, we start to see something interesting.

R3# show mpls forwarding-table labels 19
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface
19         Pop Label  10.2.67.0/24   4369          Et0/0      10.2.36.6

R3 wants to perform a pop on this traffic. Why? Because R6 advertised an "implicit-null" label to R3 for the 10.2.67.0/24 network because it's a directly connected network for R6. This is standard Penultimate Hop Popping (PHP) mechanics. What ends up happening is R3 pops the outer label and forwards the packet to R6 with just the inner, VPN label attached. In this case the VPN label is 26 (seen above on R2). R6 has no information for label 26 and drops the packet.

R6# show mpls forwarding-table labels 26
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface

It's possible that R6 could have an LFIB entry for label 26 that it learned from another peer, but even still the packets we're talking about would not get to the proper destination because they'd be routed towards that other peer.

The root cause of the issue in this example is the choice of peering address on R7. Because the peering address used on R7 is on a network that is directly attached to R6, R6 does what it's supposed to do in that situation and tells R3 the pop the outer label. In other words, R6 initiates the PHP process.

If the peering address on R7 is moved to the loopback, then everything changes.

R2# show bgp vpnv4 unicast vrf BRANCHES 192.168.100.8
BGP routing table entry for 200:1:192.168.100.0/24, version 38
Paths: (1 available, best #1, table BRANCHES)
  Advertised to update-groups:
     3
  Refresh Epoch 1
  Local
    10.1.7.7 (metric 31) from 10.1.7.7 (10.1.7.7)
    <snip>

NEXT_HOP on R2 is now R7's loopback.

R2# show mpls forwarding-table 10.1.7.7
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface
None       22         10.1.7.7/32    0             Et0/0      10.2.23.3

Different outgoing label on R2 which means a different LSP is now being used.

R3# show mpls forwarding-table labels 22
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface
22         17         10.1.7.0/24    7697          Et0/0      10.2.36.6

R3 does not perform a pop on this LSP! It does a swap to label 17 and shoots the fully labeled packet onwards.

R6# show mpls forwarding-table labels 17
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface
17         Pop Label  10.1.7.0/24    8324          Et0/1      10.2.67.7

Now it's R6 that performs the pop action. In this case that's perfectly OK because R7 — being the egress PE — knows what to do with a packet labeled with just the VPN label of 26 (the label value of 26 originated from R7 and was advertised to R2 via MP-BGP so of course it knows what to do with it).

Now that the NEXT_HOP is proper and the ingress PE is using the right LSP, there is end-to-end connectivity.

Incidentally, IOS logs a warning message on R7 when using ethernet0/0 to peer with R2 in an attempt to warn the network operator of this very situation:

%BGP-4-VPN_NH_IF: Nexthop 10.2.67.7 may not be reachable from neigbor 10.1.2.2 - not a loopback

An alternate solution to changing the peering address on R7 is to configure a route-map either on R7 or on R2 which modifies the NEXT_HOP attribute in BGP updates sent from R7 to R2.

Disclaimer: The opinions and information expressed in this blog article are my own and not necessarily those of Cisco Systems.

The Importance of BGP NEXT_HOP in L3VPNs

Related posts