Five Functional Facts about OTV

Following on from my previous “triple-F” article (Five Functional Facts about FabricPath), I thought I would apply the same concept to the topic of Overlay Transport Virtualization (OTV). This post will not describe much of the foundational concepts of OTV, but will dive right into how it actually functions in practice. A reasonable introduction to OTV can be found in my series on Data Center Interconnects.

So without any more preamble, here are five functional facts about OTV.

#1 – OTV Adds 42 Bytes of Overhead

OTV, being an encapsulation technology, adds additional headers to the encapsulated payload. Without rehashing too much of the basics, OTV extends a Layer 2 domain across a Layer 3 cloud. In order to preserve the Layer 2 semantics on either side of the cloud, OTV scoops up the entire Layer 2 packet on one side, transports it across the cloud in the middle, and puts it on the LAN in the other side. This preserves the entire Ethernet header including the original source/dest MAC, and even the CoS bits and VLAN tag.

So to begin with, we’re putting a (potentially) full-sized Ethernet frame – with headers – inside another Ethernet frame. That alone will grow the size of the packets that get sent across the cloud.

But on top of that, OTV needs to add some of its own information to the packet so that the remote OTV edge device knows, among other things, which VLAN the encapsulated packet should be put onto (OTV strips the 802.1Q bits from the original packet, if present). There also needs to be an IP header put on the front of the whole packet that will get it safely across the Layer 3 cloud.

Here’s where we muddy the waters a little bit. There are actually two encapsulation formats used by OTV on the wire. First, there’s the UDP-based format as outlined in the “hasmit” draft. You can think of this as the “standards track” format. Second, there’s the actual-in-use-today format of GRE over MPLS. If you have’t read the analysis of OTV by Brian McGahan at INE, go and read it now. It very concisely breaks down what this format looks like on the wire.

Why two formats? Well, OTV functionality has not been baked into the current generation ASICs found on the M1/M2 line cards on the Nexus 7000 (the Nexus 7k is being heavily positioned as the box where you have your Layer 2/Layer 3 boundaries which is exactly where you do OTV). Cisco Engineering had a choice: wait until new ASICs are developed and shipped or bring OTV to life using encapsulation formats that the currently shipping ASICs support. They chose the later and came up with the GRE over MPLS format.

At least when it comes to the amount of overhead imposed, there’s no ambiguity: both formats add a total of 42 bytes of overhead to the original packet as it’s sent between the OTV edge devices. Here’s the breakdown for both formats:

GREoMPLS:

[OuterEth=14] [IP=20] [GRE=4] [MPLS=4] [InnerEth=variable]

UDP:

[OuterEth=14] [IP=20] [ UDP=4] [OTV=4] [InnerEth=variable]

Here’s the point. OTV adds 42 bytes on top of the original frame, but who cares? Well, here’s two very important data points:

  1. The hasmit draft says, “The addition of OTV encapsulation headers increases the size of an L2 packet received on an internal interface such that the core uplinks on the Edge Device as well as the routers in the core need to support an appropriately larger MTU. OTV encapsulated packets must not get fragmented as they traverse the core, and hence the IP header is marked to not fragment by the Edge Device.
  2. Fragmentation and reassembly are not available on the Nexus 7000

In other words, if OTV scoops up a full-sized, 1500 byte frame at one site, adds 42 bytes of overhead to make a 1542 byte frame, the network between that OTV edge device and the edge device at the far site must therefore support an MTU of at least 1542 bytes.

Now as with most things, there is an exception. The ASR 1000 has support for fragmenting OTV packets. Enabling this capability should be done with caution: it violates the OTV hasmit draft and will immediately break interop with Nexus 7000 edge devices.

Keep in mind that when I say “the network between the edge devices”, that could include part of the data center network too and not just the WAN/DCI. Every port along every possible path between the edge devices must support a minimum MTU of 1542 bytes.

#2 – Address Learning is a Control Plane Function

Since we’re talking about Layer 2/Ethernet networks when it comes to OTV, it would be understandable to assume that the OTV Edge Devices perform address learning the same way as Ethernet switches. However, that’s not the case.

The Edge Device that first sees a packet from a local device in an extended VLAN will advertise that device’s MAC address to other OTV devices by using the OTV control plane protocol, IS-IS for Layer 2. This works just like a routing protocol such as EIGRP or OSPF picking up a locally attached subnet and advertising it throughout the network. The other devices participating in the protocol learn of the presence of the subnet (or MAC) and can then direct traffic towards it.

This is an important point: it’s only when the local Edge Device has advertised the learned MAC address to the other OTV devices that those other devices can start directing traffic towards the local MAC address. Think of it like this: a router running OSPF cannot forward a packet unless there’s a match for the destination address in its forwarding table. Like in that example, OTV does not forward frames across the overlay unless there’s a match for the destination MAC – ie, it’s learned the MAC via Layer 2 IS-IS – in its forwarding table. Unlike Ethernet bridging, OTV does not flood frames for which it doesn’t have a forwarding table entry.

This has some implications. Namely, if an end device is silent and does not emit any traffic, the local Edge Device will not learn its MAC and will not advertise it to other OTV nodes. This results in the device being unreachable from remote OTV-enabled locations. This cases of this happening are rare (syslog servers, NetFlow collectors (?) are two examples I can think of) but it’s important to understand.

Now having said that there is, of course, an exception. In NX-OS 6.2(2) and up on the Nexus 7000 there is support for selective flooding of unknown unicast MACs. This is meant to address the case I explained above where there’s silent hosts in the network.

Lastly, with respect to Layer 2 IS-IS, it’s a native part of OTV and is automatically enabled as part of enabling OTV. There’s no explicit configuration that you need to perform. Really, you don’t even have to understand how IS-IS works if you don’t want to as its operation is really “under the covers” here.

#3 – STP is Filtered

I’ve written before about the risks of fate sharing between data centers. One way to keep the failure domains isolated between data centers is to ensure they are not part of one giant Spanning Tree domain. Having such a domain of course has implications on traffic forwarding (as traffic has to be forwarded along the tree and that tree now extends between DCs) but also has implications around network convergence and scalability.

In order to address these issues, OTV natively blocks STP BPDUs from crossing the overlay. This effectively partitions the STP domain as it prevents one DC from hearing BPDUs from another other DC. Each DC then elects its own root switch and has its own locally constrained tree topology. Additionally, any topology changes or reconvergence events are isolated to that DC.

#4 – Inter-DC Multicast is Supported

Most of time when it comes to inter-DC traffic, the focus is on unicast traffic. However multicast plays a big role in some data centers. OTV is able to handle inter-DC multicast traffic in much the same way as unicast traffic. The main difference is that instead of learning end station MAC addresses, it has to learn whether or not there are receivers on the local network for a particular multicast group. The OTV Edge Device does this by snooping IGMP Report messages. When the Edge Device sees that a specific end station is interested in a particular group, it advertises the {VLAN/multicast MAC/multicast group IP} triplet across the overlay, thus informing the other OTV devices that there are receivers in the given VLAN for that multicast group.

Now, regardless where the source is, receivers in any of the data centers will be able to receive messages sent to the group.

Just to be clear: In the local site, the OTV Edge Device is directly connected to the VLAN(s) that are being extended. In this way, it’s able to snoop the IGMP Reports being sent by end stations. OTV is only interested in what the end stations are doing. It is not playing a part in extending a multicast core infrastructure between the DCs; it’s not replacing the (potential) need for PIM in the underlay network.

#5 – OTV Has Hardware Dependencies

OTV, being an encapsulation technology, must be done in hardware in order to get any reasonable sort of performance. That means there must be support in the switching ASICs for the protocol. And consequently, it means that you must be using switches/routers/line cards that have these ASICs (or the right kind of internal hardware architecture) to make it work.

Here’s a summarized list of what does and does not have OTV support:

  • Nexus 5500
    • No, and never will. Switching ASICs do not support OTV.
  • Nexus 6000
    • No, and never will. Switching ASICs do not support OTV.
  • Nexus 7000/7700
    • Yes.
    • M1/M2 series cards for the join interface
    • The forthcoming F3 series cards will also support OTV encap/decap and can be used as the join interface
  • ASR 1000
    • Yes. The Quantum Flow Processor in the ASR allows it to support OTV. Requires IOS-XE 3.5.0S and higher.

Take note that there are scale differences between the ASR and the n7k. Refer to software release notes for the details.

12 thoughts on “Five Functional Facts about OTV”

  1. We are having problems with our slow OTV. The topology is setup as:

    Main DC:::::::::::::Servers –> Nexus 5K –> ASR1K —–> (Core)—> (WAN)—->::::::::::::DR site::::::::::::::: ASR1K —> Nexus5K —> servers (for replication)

    We have a 1 gig connection of WAN. but when our server group do replication the maximum speed is 60-70 Mbps on the WAN using the OTV link.

    Other configurations are:

    No Jumbo frames are allowed.

    Fragmentation is allowed at the join interface of both the ASR because the of OTV header of 42 bytes.

    We have full connectivity but the link is slow. What could be the possible solution?

    Response will really appreciated.

    Thanks

    1. Hi arsalan,

      Just to get the basics out of the way: what’s the utilization like on your WAN between the sites? How about on your ASR and 5k interfaces, anything running hot? Any QoS in there anywhere that’s dropping things or shaping down this traffic?

      Just out of curiosity, what’s the CPU load on your ASRs when the replication is running?

  2. Hi I have hopefully a simple question also, we are planning to use OTV between 2 ASR 1K routers between two of our sites our service provider is uses MPLS to link our 2 sites, so we use VLAN tag encapsulation and sub interfaces on the WAN side of our ASR’s . In all the examples of OTV configuration I have yet to see any using sub interfaces in the config with ASR routers for the WAN interface. Does this mean that my plan to use OTV across such a WAN network is not going to work ????…
    Thanks in Advance…Simon

    1. Hey Simon,

      I think the reason you don’t see subints in examples is because the authors just chose not to do it that way :)

      I did do a quick check and although I can’t find anywhere that explicitly says this is supported, I see config examples of it being done. If your local Cisco office has a lab, ask your SE if you can go in for a few hours to play on one of their ASRs. Then you’ll know for sure.

  3. http://www.netcraftsmen.com/otv-best-practices/

    I have a requirement where i have my OTV device (ASR 1004) conneted to my Agg layer (N7k) with Join interface and Internal interface.Agg layer is connected to Core layer (N7k) via L3 and Core layer(N7k) is further connected to another core laye (VSS) via L2 over L3 ( via trybk link).I have to extend my Layer2 between Core VSS and Aggregation layer so i have 2 options

    1) Extending a direct trunk connection between VSS and Agg layer.

    2) Connecting OTV ASR device via internal interface with Core VSS—I am not sure whether this solution would work as we already have another internal interface connected to Aggregation layer hence need your advice in this.

    1. Hi sunny, I’m having a hard time understanding your topology from reading the description. I’m more of a visual guy. There’s a lot of nuances to designing something like this properly so I recommend you get professional help to come in and work on this with you. You’re not going to get the advice you need via a blog or message board.

    1. Hi Miguel,

      No, there’s no pseudowires involved. The GRE over MPLS format is just using a GRE header and MPLS header in the data plane to convey OTV messages between the sites. There’s no actual MPLS running between the OTV edge devices. GRE and MPLS were just convenient formats in which to transport the data.

      Is that any clearer?

  4. Although you can fragment OTV on ASR, you never want to. Fragmentation is extremely CPU-intensive. On IOS XE, many features like IPSec and NAT have been optimized to use the multiple cores of the router. Fragmentation has not. It works today about the same as it did on a 2500-series router from two decades ago.

    In our tests, turning on fragmentation on an ISR 4451 router dropped OTV performance from >500 Mbps to ~50 Mbps.

    If you must do OTV fragmentation, we recommend the ASR 1001-X for speeds up to 100 Mbps and the ASR 1002-X for anything higher (it has a faster ESP).

    1. All good points. Definitely on the lower-end ISRs you’re going to have a severe performance hit. The ASR data plane is able to handle fragmention without punting to the CPU, however that does still result in a mild performance hit.

Leave a Reply

Your email address will not be published. Required fields are marked *

Would you like to subscribe to email notification of new comments? You can also subscribe without commenting.