It seems appropriate to write a FFF post about Virtual Extensible LAN (VXLAN) now since VXLAN is the new hotness in the data center these days. With VMware's NSX using VLXAN (among other overlays) as a core part of its overall solution and the recent announcement of Cisco's Application Centric Infrastructure (ACI) and the accompanying Nexus 9000 switch, both of which leverage VXLAN for delivering a network fabric, it seems inevitable that network engineers will have to use and understand VXLAN in the not too distant future.
As usual, this post is not meant to be an introduction to the technology; I assume you have at least a passing familiarity with VXLAN. Instead, I will jump right into 5 operational/technical/functional aspects of the protocol.
For more information on VXLAN, check out the draft at the IETF.
1 - VXLAN Use Cases⌗
Despite the apparent ubiquity and fervent hype around VXLAN, it's actually been designed to solve specific problems. It has not been designed to be "everything to everyone".
The first, and most often cited, use case is for data center operators that require more than ~4000 logical partitions in the network. These 4000 partitions equate to the maximum number of VLANs that can be created as per the 802.1Q standard, which defines the VLAN tag as a 12-bit number. Typically, only the largest operators are running into this limitation. Cloud service providers that have thousands of customers would naturally burn through this limit very easily. Enterprises are more likely to be comfortable with a 4000 VLAN limit as they require significantly less partitions in their networks.
VXLAN addresses the 4000 VLAN limitation by using a 24-bit identifier called the VXLAN Network Identifier (VNID). This allows for over 16 million logical segments.
The second specific use case that VXLAN addresses is around the scalability of Layer 2 domains. Numerous data center applications and systems are (frustratingly) simple in their view of the world and require Layer 2 adjacency among an associated group of end devices. This necessitates stretching Layer 2 domains all across the data center in order to accommodate growth without breaking the Layer 2 adjacency requirement that these apps/services have.
Big Layer 2 domains come at a cost though. They create big broadcast domains and big failure domains. They also require the use of Spanning Tree. None of these things are desirable in the data center, for multiple reasons.
In a nutshell, what VXLAN does is enable Layer 2 adjacency across a routed (Layer 3) fabric. This has several advantages:
- Broadcast and failure domains are isolated
- The fabric does not depend on STP to converge the topology; instead, Layer 3 routing protocols are used
- No links within the fabric are blocked; all links are active and can carry traffic
- The fabric can load balance traffic across all active links, ensuring no bandwidth is sitting idle
This second use case is a lot like how I've described OTV with the difference being that VXLAN is intended for use inside the data center and OTV is (primarily) used between data centers.
2 - VLXAN Adds 50 Bytes of Overhead⌗
As described before (in DCI Series: Overlay Transport Virtualization), the term overlay is synonymous with encapsulation. VXLAN adds a total of 50 bytes of encapsulation overhead to each packet. Here's the breakdown:
[ Transport_Ethernet (14 bytes) ] [ IP (20 bytes) ] [ UDP (8 bytes) ] [ VXLAN (8 bytes) ] [ Original_Ethernet_Frame ]
Note that the VXLAN header itself is quite small (at 8 bytes) but it requires adding an outer UDP, IP and Ethernet header in order to get the packet from VTEP to VTEP (VXLAN Tunnel Endpoint).
As mentioned with OTV, this kind of encapsulation requires jumbo frames in the data center fabric to enable full 1500 byte frames to be encapsulated. And if jumbo frames are being used by the end devices connected to the VXLAN, then the fabric MTU needs to accommodate the size of the jumbos plus the 50 bytes overhead.
3 - VXLAN Does Not Have a Control Plane⌗
If you read the IETF draft above, you'll notice that it defines the method by which VXLAN participants exchange VXLAN packets. It also defines the VXLAN packet format and how the levels of encapsulation work in order to get Layer 2 frames transported across a Layer 3 infrastructure. If you read carefully though, you'll notice something that's missing is any mention of a overriding protocol which operates between the VXLAN speakers, i.e. a control plane protocol.
A control plane protocol is responsible for driving the behavior of the data plane (what the VXLAN draft describes is the data plane protocol). Consider an IETF RFC that defines a control plane protocol such as RFC 4271 which defines the Border Gateway Protocol 4 (BGP). BGP is a control plane protocol which drives the behavior of the data plane, which in this case, is Internet Protocol (IP). BGP defines how speakers form relationships with one another, exchange topology information, and make best path decisions. Based on each of these processes, BGP then programs the data plane by inserting/removing forwarding information in the IP routing table. This action by BGP influences the behavior of the IP data plane: where it sends packets, what part(s) of the network are reachable, what links are preferred over others, and so on.
Without a control plane, how do VXLAN speakers understand the network topology? How do they build a forwarding table?
Well, without a control plane component, it's all up to the data plane. The data plane is responsible for performing address learning and understanding what hosts are "out there" on the network. Note however that each VXLAN speaker performs this function independently and does not share what it learns with its neighbors as that would be a function of a control plane.
A good reference for understanding data plane learning is actually Ethernet. Think about an Ethernet switch. It performs local address learning by storing a tuple of the input port, source MAC address and VLAN of any frame it receives. It stores this tuple in its forwarding table. However it does not advertise this information to its neighbors. Ethernet depends on each speaker to perform its own address learning.
On a bit of a side note, what this can lead to is the idea of "unknown unicast" frames where a switch receives a frame destined to an address it doesn't have programmed in its forwarding table. It's possible a neighbor switch has knowledge of the destination, but again, it has no way to tell the first switch. In this case, the switch has no choice but to "flood" the frame out all ports in the VLAN and and hope that the end device will eventually receive the frame.
The VXLAN data plane does address learning much like Ethernet: upon receipt of a VXLAN packet, the switch records the source VXLAN Tunnel Endpoint (VTEP) IP, the inner source MAC address, and the VXLAN Network ID (VNID) tuple in its forwarding table. The VXLAN speaker then knows that when it receives traffic in that VLXAN segment destined for that MAC address, it should encapsulate it in a VXLAN packet addressed to that particular VTEP.
In the case of an unknown unicast, the VXLAN speaker will "flood" the VXLAN packet by sending it to the multicast group associated with the VNID/VXLAN segment.
4 - VXLAN Does Not Define Security on the Overlay⌗
The VXLAN draft is quite clear that it does not contain any security mechanisms nor provide any Confidentiality, Integrity, or Authentication (CIA) for VXLAN packets. Now on the flip side, neither does Ethernet. However in the case of Ethernet, in order for an attacker to inject packets into the ethernet and cause (packet)mischief, they would have to be directly attached to that data link. This requirement creates defense against a would-be attacker as the network within the data center is usually quite physically secure and not likely to be an easy target for an attacker to attach to.
When adopting a MAC-in-IP scheme such as VXLAN, the accessibility of that network opens right up. Since the endpoints (VTEPs) on the network are addressable by IP, traffic can be directed towards them from anywhere else in the internetwork. Taken one step further, an inner Ethernet frame could be crafted to appear as though a legitimate server in the data center sent the packet thereby allowing the attacker to impersonate that server all without the attacker being anywhere near the data center.
Another, perhaps less-likely scenario, would be for an attacker with a footprint in the data center network to issue IGMP messages which cause it to join the multicast group(s) that is used by specific VXLAN segment(s). This would allow the attacker to have full visibility of unknown unicast and broadcast messages sent on those segment(s).
Being still relatively new, I haven't seen a lot of discussion on how to protect VXLAN traffic from these issues (and I'm sure, others) but here are some ideas I've come up with:
- Protect your VXLAN transport VLAN(s)/subnet(s). Make the VLAN dark; don't route it. If there's multiple transport VLANs, put them in their own VRF and don't leak those prefixes into VRFs where end users and non-infrastructure IP space is routed.
- If unable to perform the actions in the point above, employ Unicast Reverse Path Forwarding (uRFP) checks at the data center edge (at a minimum) to prevent an attacker outside the data center from spoofing a VXLAN packet from a valid VTEP IP address.
5 - VXLAN Packets Have Limited Entropy for ECMP/Hashing⌗
Going back to #2 above and looking at the format of a VXLAN packet, we can see that there's IP and UDP that carry VXLAN across the network. If we go back to #1 we also see that a benefit of VXLAN is that we can use routed links in the fabric which enable all links to be forwarding and all links to be carrying traffic. Now if we think of these two things together, we might start to wonder how the format of the packet would affect the equal cost load balancing and/or etherchannel hashing algorithms between fabric nodes.
Within the underlying network that carries VXLAN traffic, the matrix of possible source IPs and destination IPs is quite small since the IPs in question are simply the IP addresses assigned to the VTEPs (and NOT the IPs belonging to the VMs and other end devices in the network). The entropy provided by those two header fields is further reduced because there's an N:1 relationship between VM/end device IPs and the VTEP IP which services them. It would be quite normal to have multiple (N) end devices all being served by a single (1) VTEP. As seen within the underlay network, all the traffic from these N hosts would have a source IP belonging to the VTEP. Same story for the destination IP.
If we look at the Layer 4 information, the UDP header that precedes the VXLAN header always has the same well-known destination port which IANA has assigned for VXLAN: 4789. The source UDP port would (hopefully!) be randomized by the sending VTEP.
So of the four fields normally checked as part of a load balancing hash (source IP, destination IP, source port, destination port), only the source port has any real degree of entropy. For that reason, it's critical to ensure the fabric links which carry VXLAN traffic are performing hashing based on Layer 3 and Layer 4 headers.
By default, Cisco Express Forwarding (CEF) does not utilize Layer 4 port information when making equal cost forwarding decisions.
I hope the information in this post was useful. Please keep in mind that VXLAN is evolving and that one day, parts or maybe all of this post will be out of date. I will do my best to update this post and point out parts that are no longer accurate.
Disclaimer: The opinions and information expressed in this blog article are my own and not necessarily those of Cisco Systems.