Why is there a "Wrong Way" to Interconnect Datacenters?

There's certainly a lot of focus on data center interconnection (DCI) right now. And understandably so since there are many trends in the industry that are making IT organizations look at data center redundancy. Among these trends are:

The business is saying to IT that they require their IT services to be available at all times. In effect the business is saying that they want to be shielded from technology issues, maintenance windows, and unplanned downtime because the IT services they consume (not all of them mind you, but certainly some of them) are so critical to running the business that they cannot be without them (or, they cannot be without them for whatever period of time it would take IT to recover the service).
The technical ability to move workloads between sites thanks to the near ubiquity of features like vMotion and Live Migration. The ability to pick up an application and swing it over to another location makes item #1 above far less daunting to IT shops and lowers the barrier to adoption.

In this post I'm going to talk about how IT can address item #1 above — the business need — in a manner that does not introduce hidden risk into the environment. This is a common conversation that a lot of IT organizations are having right now but unfortunately the easiest and most obvious outcome from those conversations is not always the one with the least risk.

In the second post of this DCI mini series, I'll talk more about item #2 since that's the one that drives a lot of the technical requirements that have to be met when delivering the overall solution to address #1.

Why Do We Need This?⌗

Why do we need more than one data center at all? Isn't one professionally run, highly redundant facility enough?

For some businesses, no, one is not enough. As I mentioned above, many businesses cannot accept the risk of having certain IT systems and services going offline. IT is so critical to business operations now that the impact of them not being there can mean real money lost, real brand damage, and possibly very real health and safety issues.

Even though a single data center can have many redundant systems; even though the IT infrastructure inside the data center can be architected for high availability; even though many precautions can be taken to ensure maximum uptime of IT services, there are some risks that cannot be controlled with just a single data center. Natural disasters, pandemics, civil issues, massive facility failure. The list goes on. And unlikely as these situations may be, they do happen and the damage to a business can be irrecoverable.

So the answer to why we need more than one data center comes down to a business decision to manage the risk of a large-scale issue affecting the single data center and thereby affecting the business as a whole.

The Plumbing⌗

Now, in order to return value to the business from these data centers, they obviously must be capable of running the services that the business is depending on. The plumbing that makes that possible (at a foundational layer) is the network.

So if the objective of having more than one data center is to manage risk, the question has to be asked: how is the network helping achieve this? With multiple data centers, the areas of power, cooling, and facilities in general are handled implicitly — simply by having a second location you get redundancy and diversity in those systems (I know that this redundancy doesn't necessarily come automatically and can require a lot of planning, but just go with it). Areas such as storage, compute, management and other infrastructure are handled explicitly through the purchase and configuration of additional storage subsystems, servers, appliances, etc.

All of these components are discrete within each site. The one component, however, that is not discrete is the network. By necessity, the network must touch all sites and it must be common to all sites in order to facilitate communication between them.

If we cannot break the network into discrete parts then how can we manage the risk of an issue affecting "The One Network" and taking out both data centers?

The key is to understand the concept of failure domains within the network and then manage them.

Defining the Failure Domain⌗

In order to manage failure domains we first have to be able to identify what it is. From a network perspective, defining a failure domain is pretty simple.

The failure domain of the network has the same breadth and reach as the Layer 2 domain. That is to say, if a VLAN spans from A to B, the failure domain encompasses A and B. If a VLAN spans from A to B to C to D to E, the failure domain encompasses all those devices. Every switch, router, firewall, hypervisor host, and end device that touch a common VLAN all share the same failure domain (Layer 2 switches and end hosts being 100% within the domain, and routers or firewalls having at least one leg in the domain).

Why is this a failure domain? Well, a failure domain is the boundary within which a failure or disruption will be contained. From another perspective, it's also the domain within which all entities will be affected by that failure or disruption. When we consider Ethernet and the kinds of disruptions that can occur, they are all bounded within the Layer 2 domain. And really this makes sense since Ethernet is a Layer 2 protocol and really cannot have any influence outside of the Layer 2 domain.

If we consider common disruptors that affect an ethernet such as broadcast floods, high levels of unknown unicast flooding, and topology loops, we realize that each of them is indeed limited to the Layer 2 domain in which they originate.

These disruptors have the ability to greatly affect all devices in a Layer 2 domain. In the worst case, if the Layer 2 domain is stretched between data centers and a disruption occurs in the network, all the time and expense that was put into isolating failure domains from a power, storage, compute, and application perspective will have been lost. The negative affect on the network will be felt globally, across both data centers, and ultimately by the applications and services that the data centers are supposed to be providing to the business in a resilient and highly available manner.

Managing Failure Domains⌗

As most network engineers learn early on, the way you break up Layer 2 domains is with a router. Routers or other Layer 3 devices break up Layer 2 domains. They isolate the effects of broadcasts, flooding, and Layer 2 loops. Put simply, connecting data centers together over routed links is exactly how we create discrete networks within each data center: by creating separate islands of Layer 2 domains joined via a routed interconnect.

I say this quite confidently because of the largest and most well-known network on the planet that uses this very model: the Internet. Thousands of disparate networks all connected together using routed links. And it works. If a network somewhere has a meltdown, the rest of the global network keeps humming. It's a proven model on the largest scale. And it's a model that should be employed as part of a DCI strategy in order to manage the risk of a global network disruption within an organization's own private network.

Now Having Said all That⌗

... it's only part of the story. I just advocated against extending Layer 2 domains between data centers; however there is a very real demand for doing just that. I'm going to talk about that in my next post: what is this need and how can it be reconciled with this contradictory design of connecting data centers with routed links.

Posts in this Series on DCI⌗

Disclaimer: The opinions and information expressed in this blog article are my own and not necessarily those of Cisco Systems.