Walking with Packets: Traceroute Through MPLS Cloud

Think about this for a minute: An MPLS network with a two Provider Edge (PE) routers and some Provider (P) routers. The P routers have no VRFs configured on them and therefore have no routes whatsoever for any of the customer networks. A customer then does a traceroute from one of their sites, across the MPLS cloud, and into one of their other sites. The traceroute output shows the P routers as hops along the path.

How is it possible for the P routers to reply to the traceroute if they don't have routes back to the customer network?

The Lab Setup⌗

Here's the network:

Here's the traceroute output from R21's loopback0 to R8's loopback0 (the last octet of each IP address corresponds to the name of each router):

R21#traceroute 10.1.8.8 source loopback0
Type escape sequence to abort.
Tracing the route to 10.1.8.8
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.4.4 21 msec 18 msec 17 msec
  2 10.2.45.5 [MPLS: Labels 21/24 Exp 0] 19 msec 18 msec 18 msec
  3 10.2.15.1 [MPLS: Labels 21/24 Exp 0] 19 msec 17 msec 18 msec
  4 10.2.16.6 [MPLS: Labels 21/24 Exp 0] 16 msec 19 msec 19 msec
  5 192.168.100.7 [MPLS: Label 24 Exp 0] 18 msec 18 msec 19 msec
  6 192.168.100.8 16 msec *  18 msec

Hops 2, 3, and 4 are the P routers.

The Investigation⌗

Let's start by looking at R5. R5 definitely doesn't have any routes back to R21's loopback:

R5#show ip vrf
R5#
R5#show ip route 10.1.21.21
% Subnet not in table

However, we also know that the traceroute packets are being sent between R4 and R5 as MPLS label switched packets (ie, they have an MPLS label in the header). We can find out precisely which label(s) by looking in the FIB on R4 to see which labels it's imposing on the packet before sending it:

R4#show ip cef vrf BRANCHES 10.1.8.8
10.1.8.0/24
  nexthop 10.2.45.5 Ethernet0/1 label 21 24

Since this is an MPLS L3VPN, there are two labels:

24 is the inner label which is advertised all the way from R7 to R4; this label is imposed on the packet by R4 and is carried all the way through the network unchanged from R4 all the way to R7 where R7 uses it to look up which VRF the packet should be forwarded in
21 is the outer label and is the label advertised by R5 towards R4 which R4 uses to get a packet through the network to R7

This post assumes some familiarity with MPLS L3VPN which is why I'm only giving short bullet points to explain what each label is used for.

From the FIB output we know that R5 is getting an MPLS packet from R4 with an outer label of 21. This is the only information whatsoever that R5 has in which to send a reply packet. However, given that MPLS Label Switched Paths (LSPs) are undirectional, it cannot simply send a packet back to R4 because the label of 21 doesn't tell R5 anything about how to get a packet back to R4. R5 must send its response forwards, in the direction of R7, along the LSP. We can test this theory by breaking out the sniffer and looking at the eth0/1 interface on R5.

The reason you don't see the first three UDP traceroute probes in amongst packets 34, 36, and 38 is because this capture is taken on eth0/1 of R5 and the UDP probes come in on eth0/0 and are dropped before being switched out on eth0/1.

Starting at packet number 34: there's a time-to-live exceeded response coming from R5 destined to R21. I know it's coming from R5 because the source MAC address belongs to R5. The MPLS label stack is {21, 24} (it's a complete coincidence that the outer label is 21 on both R4 and R5). Where does label 21 lead to on R5?

R5#show mpls forwarding-table labels 21
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface
21         21         10.1.7.0/24    6458639       Et0/1      10.2.15.1

It leads to 10.1.7.0/24 which is R7's loopback which is the BGP NEXT_HOP address on R4 for 10.1.8.0/24.

Everything checks out so far. R5 is sending its response to the traceroute forward along the LSP. But then how does the reply actually get to R21 if R5 is sending its response away from R21?

Look at packet 35 in the sniffer capture: it also shows a time-to-live exceeded packet sourced from R5's IP and destined to R21. Looking at the source MAC address, I know this packet is actually being sent by R1 towards R5. Where does label 16 lead to on R5?

R5#show mpls forwarding-table labels 16
Local      Outgoing   Prefix         Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id   Switched      interface
16         Pop Label  10.1.4.0/24    6194007       Et0/0      10.2.45.4

It leads to 10.1.4.0/24 which is R4's loopback. Now the time-to-live exceeded packet is heading towards R21.

Conclusion⌗

Here's the sequence of events:

R5 has no idea how to get a packet back to R21, all it has is the outer label (21) in the traceroute probe.
R5 uses the label 21 to do a lookup in the LFIB and send its time-to-live exceeded message onwards on the LSP towards R7.
R7 receives the packet and does a lookup in the appropriate VRF (based on the inner label, 24) and sees that 10.1.21.21 is reachable back through the MPLS cloud. R7 puts the right label stack onto the packet and pushes it back into the MPLS cloud.
At the point where R5 receives this packet it has a label stack of {16, 19} which R5 uses to forward the packet on towards R4.
R4 also uses the inner label (19) to do a lookup in the LFIB and forward the packet to R21.
As shown in the packet capture, R5, which originated the time-to-live exceeded message, put its own IP address in the "source IP" header field and that value is maintained throughout the packet's journey and finally ends up being displayed in the traceroute output on R21.

Bonus Points⌗

Why does the traceroute output on R21 show R7's IP address (hop #5) as 192.168.100.7 and not 10.2.67.7 (the interface which faces towards R21)?

R21#traceroute 10.1.8.8 source loopback0
Type escape sequence to abort.
Tracing the route to 10.1.8.8
VRF info: (vrf in name/id, vrf out name/id)
  1 10.4.4.4 21 msec 18 msec 17 msec
  2 10.2.45.5 [MPLS: Labels 21/24 Exp 0] 19 msec 18 msec 18 msec
  3 10.2.15.1 [MPLS: Labels 21/24 Exp 0] 19 msec 17 msec 18 msec
  4 10.2.16.6 [MPLS: Labels 21/24 Exp 0] 16 msec 19 msec 19 msec
  5 192.168.100.7 [MPLS: Label 24 Exp 0] 18 msec 18 msec 19 msec
  6 192.168.100.8 16 msec *  18 msec

The answer is that it's an optimization in IOS which causes time-to-live exceeded messages to be sourced from an IP address that belongs to an interface in the same VRF as the source of the traceroute probe. Put another way: because R21 is in the BRANCHES VRF, R7 replies from its 192.168.7.7 address because it also belongs to the BRANCHES VRF.

While I can't say for sure why IOS does this, it makes sense for troubleshooting. You expect the interfaces in a traceroute to be reachable for doing ping tests or other reachability tests. However R21 can't reach the IP addresses from hops 2, 3, or 4 (the P routers) because those addresses exist in the MPLS core and aren't a part of the BRANCHES VRF.

    R21#show ip route 10.2.0.0 255.255.0.0 longer-prefixes
    [No output]

By having R7 reply from an IP address that is reachable from R21, it provides a test point at the far edge of the MPLS cloud where we can do reachability tests and so on during troubleshooting.

R21#ping 192.168.100.7 source loopback0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.100.7, timeout is 2 seconds:
Packet sent with a source address of 10.1.21.21
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 18/18/20 ms

Extra Bonus Points⌗

If you're curious why the last hop in the traceroute always shows a "*" for the 2nd probe, check out this other post: What the *, traceroute?

Disclaimer: The opinions and information expressed in this blog article are my own and not necessarily those of Cisco Systems.