Anyone who uses Facebook or associated services will have noticed the major shutdown that happened last monday. It turns out it was actually Facebook’s “backbone” that was causing the problems. As a result, billions of people were unable to access their newsfeed.
But how can Facebook, one of the world’s biggest companies, experience an outage that shuts down not one, not two, but ALL of their services? Other than Facebook, affected services included Instagram, WhatsApp, Facebook Messenger, Oculus and even Facebook’s own internal network.
According to Facebook Engineer Santosh Janardhan, the backbone is “the network Facebook has built to connect all [of their] computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe”.
Managing infrastructure to serve billions of people globally is by no means a trivial task. In fact, it is such a monumental task that Facebook’s Engineering Team manages and constructs their own infrastructure and data centers.
You, the user, can see the result of this massive investment – no matter where in the world you are, you can see anyones profile, write them a message, and get your newsfeed delivered within seconds. The fact that Facebook is able to collect a ton of data about you and everyone you know and make this data available across the globe is the basis of how it makes revenue.
[A] command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.Santosh Janardhan, Facebook Engineer
It becomes clear that availability is Facebook’s biggest concern. If data is not flowing and ads are not showing, Facebook is not earning revenue. And yet, we just experienced total darkness in all their services for several hours.
In 2020, Facebook made about $85 Billion USD in total revenue. Based on this number, it is estimated that Facebook’s direct losses due to this outage exceed $87 Million USD.
Global availability, while complicated, is a well-understood problem. Huge internet companies have been around for a while, and have evolved the cloud to be able to provide software to the masses. As a result, availability has been a problem for companies since the beginning of time, which is why certain standards have been established in the industry.
For example, it is possible to segment your infrastructure into several availability zones. Availability zones are independent of one another and can span several geographical regions. The idea is that if one availability zone is affected by an outage, another one could step in and serve the software as usual, with minimal impact to the end-user.
Does Facebook not employ multiple availability zones to provide their services?
Well, first of all, they do, and second of all, its not that easy. Facebook – more than any other company – collects and stores massive amounts of data. We are talking seriously massive amounts of interconnected data. This data is stored on databases that are optimized to quickly store the data, and be able to find and deliver it quickly when accessed.
This becomes a problem when multiple availability zones get involved. As discussed earlier, it is essential for Facebook to be globally available. Users want to be able to send messages to their friends and family and see the feeds no matter where they are.
Databases are also bound to the availability zone they are in. If data from one availability zone is to be made available in another, you cannot just query the database in the other availability zone. This would defeat the purpose of availability zones, because each availability zone is intended to work independently of one another. Instead, the data must be replicated across regions. In any case, this introduces latency.
This latency is manageable for most applications. Facebook however works at such a massive scale that availability zones would crumble. Remember, data needs to be replicated across availability zones. As an example, AWS currently employs 81 availability zones across the world. If Facebook wanted to operate in all of these availability zones, they would have to replicate data 81-fold. Yes. Not easy.
This explains why Facebook relies on the “backbone”. It is the central piece of infrastructure that connects their global services together.
Unsurprisingly, Facebook routinely maintains the backbone and is able to take down part of it while still continuing service. Surprisingly, according to Janardhan, the outage was caused during a routine maintenance.
Apparently, the reason why Facebooks backbone gave in was because a command was issued to check the availability of the backbone network. It is not clear whether the command was issued manually or by some automation. Janardhan explains that there would be checks to prevent such an outcome even if the command is issued, but these checks failed to kick in due to a software bug.
The shutdown of the backbone caused a chain of events, one of which was that Facebook’s DNS Servers also went out of control.
It is common practise to do health checks on pieces of infrastructure. For example, a DNS server pointing to an IP address may also make requests to that IP address to see if it is pointing somewhere. If it is pointing to an unhealthy resource, such as a web server that is not handling requests correctly, it is able to make changes to its own DNS configuration.
With the backbone out of service, the health checks failed and DNS Servers became unreachable even though they remained operational. As a result, when browsers around the world try to access facebook.com, there wouldn’t be authoritative DNS records to point them to a server. This abruptly causes the services to stop working for the end-user.
It is for this reason that we at meetergo provide you with multiple options to hold your meetings, so that you are not affected by major outages of individual players. Nobody should rely on a single provider. With meetergo, you can quickly switch to another provider. So if you’re planning a meeting and the service you rely on goes down, you can simply switch it and everyone is notified!