Five Functional Facts about OTV

Following on from my previous "triple-F" article (Five Functional Facts about FabricPath), I thought I would apply the same concept to the topic of Overlay Transport Virtualization (OTV). This post will not describe much of the foundational concepts of OTV, but will dive right into how it actually functions in practice. A reasonable introduction to OTV can be found in my series on Data Center Interconnects.

So without any more preamble, here are five functional facts about OTV.

#1 - OTV Adds 42 Bytes of Overhead⌗

OTV, being an encapsulation technology, adds additional headers to the encapsulated payload. Without rehashing too much of the basics, OTV extends a Layer 2 domain across a Layer 3 cloud. In order to preserve the Layer 2 semantics on either side of the cloud, OTV scoops up the entire Layer 2 packet on one side, transports it across the cloud in the middle, and puts it on the LAN in the other side. This preserves the entire Ethernet header including the original source/dest MAC, and even the CoS bits and VLAN tag.

So to begin with, we're putting a (potentially) full-sized Ethernet frame - with headers - inside another Ethernet frame. That alone will grow the size of the packets that get sent across the cloud.

But on top of that, OTV needs to add some of its own information to the packet so that the remote OTV edge device knows, among other things, which VLAN the encapsulated packet should be put onto (OTV strips the 802.1Q bits from the original packet, if present). There also needs to be an IP header put on the front of the whole packet that will get it safely across the Layer 3 cloud.

Here's where we muddy the waters a little bit. There are actually two encapsulation formats used by OTV on the wire. First, there's the UDP-based format as outlined in the "hasmit" draft. You can think of this as the "standards track" format. Second, there's the actual-in-use-today format of GRE over MPLS. If you have't read the analysis of OTV by Brian McGahan at INE, go and read it now. It very concisely breaks down what this format looks like on the wire.

Why two formats? Well, OTV functionality has not been baked into the current generation ASICs found on the M1/M2 line cards on the Nexus 7000 (the Nexus 7k is being heavily positioned as the box where you have your Layer 2/Layer 3 boundaries which is exactly where you do OTV). Cisco Engineering had a choice: wait until new ASICs are developed and shipped or bring OTV to life using encapsulation formats that the currently shipping ASICs support. They chose the later and came up with the GRE over MPLS format.

At least when it comes to the amount of overhead imposed, there's no ambiguity: both formats add a total of 42 bytes of overhead to the original packet as it's sent between the OTV edge devices. Here's the breakdown for both formats:

GREoMPLS:

[OuterEth=14] [IP=20] [GRE=4] [MPLS=4] [InnerEth=variable]

UDP:

[OuterEth=14] [IP=20] [ UDP=4] [OTV=4] [InnerEth=variable]

Here's the point. OTV adds 42 bytes on top of the original frame, but who cares? Well, here's two very important data points:

The hasmit draft says, "The addition of OTV encapsulation headers increases the size of an L2 packet received on an internal interface such that the core uplinks on the Edge Device as well as the routers in the core need to support an appropriately larger MTU. OTV encapsulated packets must not get fragmented as they traverse the core, and hence the IP header is marked to not fragment by the Edge Device."
Fragmentation and reassembly are not available on the Nexus 7000

In other words, if OTV scoops up a full-sized, 1500 byte frame at one site, adds 42 bytes of overhead to make a 1542 byte frame, the network between that OTV edge device and the edge device at the far site must therefore support an MTU of at least 1542 bytes.

Now as with most things, there is an exception. The ASR 1000 has support for fragmenting OTV packets. Enabling this capability should be done with caution: it violates the OTV hasmit draft and will immediately break interop with Nexus 7000 edge devices.

Keep in mind that when I say "the network between the edge devices", that could include part of the data center network too and not just the WAN/DCI. Every port along every possible path between the edge devices must support a minimum MTU of 1542 bytes.

#2 - Address Learning is a Control Plane Function⌗

Since we're talking about Layer 2/Ethernet networks when it comes to OTV, it would be understandable to assume that the OTV Edge Devices perform address learning the same way as Ethernet switches. However, that's not the case.

The Edge Device that first sees a packet from a local device in an extended VLAN will advertise that device's MAC address to other OTV devices by using the OTV control plane protocol, IS-IS for Layer 2. This works just like a routing protocol such as EIGRP or OSPF picking up a locally attached subnet and advertising it throughout the network. The other devices participating in the protocol learn of the presence of the subnet (or MAC) and can then direct traffic towards it.

This is an important point: it's only when the local Edge Device has advertised the learned MAC address to the other OTV devices that those other devices can start directing traffic towards the local MAC address. Think of it like this: a router running OSPF cannot forward a packet unless there's a match for the destination address in its forwarding table. Like in that example, OTV does not forward frames across the overlay unless there's a match for the destination MAC — ie, it's learned the MAC via Layer 2 IS-IS — in its forwarding table. Unlike Ethernet bridging, OTV does not flood frames for which it doesn't have a forwarding table entry.

This has some implications. Namely, if an end device is silent and does not emit any traffic, the local Edge Device will not learn its MAC and will not advertise it to other OTV nodes. This results in the device being unreachable from remote OTV-enabled locations. This cases of this happening are rare (syslog servers, NetFlow collectors (?) are two examples I can think of) but it's important to understand.

Now having said that there is, of course, an exception. In NX-OS 6.2(2) and up on the Nexus 7000 there is support for selective flooding of unknown unicast MACs. This is meant to address the case I explained above where there's silent hosts in the network.

Lastly, with respect to Layer 2 IS-IS, it's a native part of OTV and is automatically enabled as part of enabling OTV. There's no explicit configuration that you need to perform. Really, you don't even have to understand how IS-IS works if you don't want to as its operation is really "under the covers" here.

#3 - STP is Filtered⌗

I've written before about the risks of fate sharing between data centers. One way to keep the failure domains isolated between data centers is to ensure they are not part of one giant Spanning Tree domain. Having such a domain of course has implications on traffic forwarding (as traffic has to be forwarded along the tree and that tree now extends between DCs) but also has implications around network convergence and scalability.

In order to address these issues, OTV natively blocks STP BPDUs from crossing the overlay. This effectively partitions the STP domain as it prevents one DC from hearing BPDUs from another other DC. Each DC then elects its own root switch and has its own locally constrained tree topology. Additionally, any topology changes or reconvergence events are isolated to that DC.

#4 - Inter-DC Multicast is Supported⌗

Most of time when it comes to inter-DC traffic, the focus is on unicast traffic. However multicast plays a big role in some data centers. OTV is able to handle inter-DC multicast traffic in much the same way as unicast traffic. The main difference is that instead of learning end station MAC addresses, it has to learn whether or not there are receivers on the local network for a particular multicast group. The OTV Edge Device does this by snooping IGMP Report messages. When the Edge Device sees that a specific end station is interested in a particular group, it advertises the {VLAN/multicast MAC/multicast group IP} triplet across the overlay, thus informing the other OTV devices that there are receivers in the given VLAN for that multicast group.

Now, regardless where the source is, receivers in any of the data centers will be able to receive messages sent to the group.

Just to be clear: In the local site, the OTV Edge Device is directly connected to the VLAN(s) that are being extended. In this way, it's able to snoop the IGMP Reports being sent by end stations. OTV is only interested in what the end stations are doing. It is not playing a part in extending a multicast core infrastructure between the DCs; it's not replacing the (potential) need for PIM in the underlay network.

#5 - OTV Has Hardware Dependencies⌗

OTV, being an encapsulation technology, must be done in hardware in order to get any reasonable sort of performance. That means there must be support in the switching ASICs for the protocol. And consequently, it means that you must be using switches/routers/line cards that have these ASICs (or the right kind of internal hardware architecture) to make it work.

Here's a summarized list of what does and does not have OTV support:

Nexus 5500
- No, and never will. Switching ASICs do not support OTV.
Nexus 6000
- No, and never will. Switching ASICs do not support OTV.
Nexus 7000/7700
- Yes.
- M1/M2 series cards for the join interface
- The forthcoming F3 series cards will also support OTV encap/decap and can be used as the join interface
ASR 1000
- Yes. The Quantum Flow Processor in the ASR allows it to support OTV. Requires IOS-XE 3.5.0S and higher.

Take note that there are scale differences between the ASR and the n7k. Refer to software release notes for the details.