In the previous article in this DCI series (Why is there a “Wrong Way” to Interconnect Datacenters?) I explained the business case for having multiple data centers and then closed by warning that extending Layer 2 domains was a path to disaster and undermined the resiliency of having two data centers.
Why then is stretching Layer 2 a) needed and b) a go-to maneuver for DCI.
Let’s look at it from two points of view: technology and business.
The Technology Perspective
From a technology point of view, the need for extending Layer 2 between data centers is motivated by a few factors.
First, it’s easy and well understood. From a network administrator’s point of view, adding a second data center is often look on just like adding more switches within the existing data center. We connect switches within the data center using Layer 2 links so it’s natural to do the same for Data Center Interconnect (DCI). This enables the same sort of troubleshooting and operational semantics within the network as everyone is used to. It’s a well understood model, has low cost, and a very low learning curve.
Second, stretching Layer 2 makes the server and application administrators happy. Servers can be placed in either data center without concern for what subnet it will live in – it’s just one big subnet. It also enables stretched clusters (which I’m NOT advocating for, merely pointing out that it is a real-world motivator) and live moves of virtual machines using technology such as vMotion and Live Migration. Server administrators feel like they gain a lot of flexibility by having the same subnet in both sites. Additionally, they get this flexibility without having to interact too closely with the network team :-) The network “just works”.
Thirdly, and this one is thought of less often than the first two, is consideration for stateful services such as firewalls and load balancers. In order to facilitate live migrations, active client sessions need to continue to function which means network traffic for those sessions needs to continue to flow in the same north/south path and through the same firewalls/load balancers. Having the same Layer 2 domain in both sites means that north/south traffic path can be maintained for existing flows.
So that’s the technical point of view. Network administrators like it because it’s easy to understand and just an extension of what they already do and server/application administrators like it because it gives them greater flexibility in placing their workloads.
The Business Perspective
In my experience, when considering the requirements around DCI, the business perspective is rarely, if ever, considered. The IT administrators usually gravitate to enabling Layer 2 connectivity and hot moves but does the business even require that?
vMotion and Live Migration have become a sort of “easy button” to IT administrators. With a click, VMs can be moved from site to site. Who wouldn’t want that? What can easily be overlooked is that this capability is not free. The underlying infrastructure must be built to support this. This incurs capital costs, human resource costs, and of course consumes valuable time. Given this, it’s pretty important to understand whether this capability is even required.
As consumers of the services that live in the data center, the business should be consulted to understand what sort of tolerance there is for application/service down time. The difference between a requirement of “zero downtime” and “no more than 20 minutes” can make a huge impact on the design of the infrastructure and may even eliminate the need for stretching Layer 2 domains.
A business acceptance of greater-than-zero downtime would put cold migrations into play as a possible solution. Since active client sessions are not maintained during cold moves, the VM’s IP address doesn’t need to remain the same and the north/south traffic flows do not need to be maintained. The VM can pop onto the network in a different site with an IP address from that site’s assigned IP range. DNS and/or load balancer modifications can then redirect clients to the server’s new IP. Technologies such as Location/ID Separation Protocol (LISP) also enable this sort of cold move quite seamlessly.
The point I’m trying to make here is that the true need for stretched Layer 2 should be understood via consultation with the business users before time and money are put into designing a stretched Layer 2 environment. Not only could time and money be saved, but the risk of stretching Layer 2 between data centers can also be avoided.
What’s the Point?
OK, good question. What am I trying to say here?
First, it’s to outline where the need for stretched Layer 2 comes from. In the previous article of this DCI series, I made the case that stretched L2 is risky and shouldn’t be done. In future articles, I will explain some options for stretching L2 safely. I needed to bridge the gap and explain why we need to do all this in the first place. Basically: we’re familiar with bridging ethernet which makes it a go-to solution and the app/server people want hot migrations of workloads.
Secondly, I wanted to highlight the fact that not every application needs to be hot migrated. We tend to start our DCI designs with that in mind when really that should be considered the platinum-with-gold-plating-and-carbon-fiber-moulding solution. Hot migrations should not be the default. It’s not free and there are consequences. If it’s really needed, go to it (and stay tuned for my future articles on how to do it with less risk than bridging) but if it’s not needed, consider all your options.