On October 4, Facebook was offline for about six hours due to human error. The company states that “configuration changes on our backbone routers” was the cause. In this post, I’ll explain what happened and walk through the takeaways for running your own business network.
Before I get into the details, you first should understand two important internet protocols: the Domain Name System (DNS) and Border Gateway Protocol (BGP). Both of these played important roles in the Facebook outage. DNS is the internet’s phone book, translating names such as facebook.com into the numeric IP address that is used to identify its main servers. You can think of BGP as the internet’s traffic cop, moving billions of packets of data from one place to another, trying to avoid congested or non-working pathways.
We’ve previously written about DNS and why it is important to secure it. BGP has several well-known weaknesses — at least to security experts. Back in 1998, members of an elite hacking group called L0pht Heavy Industries testified before Congress:
They warned that computer networks were embarrassingly insecure, bragging that any one of them could take the entire internet down in just a few minutes thanks to weaknesses in BGP routing.
Unfortunately, this time around, Facebook suffered a self-inflicted wound when one of their network engineers sent a command that basically took the entire company’s server collection off the internet. It mostly followed the ideas first presented in that 1998 testimony. Although Facebook engineers are intelligent folks, they admit the fact that “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.” Oops.
Steve Bagley, a professor of computer science at the University of Nottingham, has explained what happened and uses a routing mapping tool to illustrate how Facebook servers disappeared from the network connections until it appeared as completely offline. The important thing to note about this outage is that the computer servers were still running — they just weren’t accessible by anyone.
The Facebook outage has certainly been the most noteworthy recent problem related to BGP; however, hackers have been using BGP to complement their attacks for many years. These issues were used in the MEWKit phishing attack back in May 2018 to hijack Amazon’s servers and direct traffic to the Russian hackers that were operating the malware.
In Facebook’s post-mortem post, they mention their “storm drills” in which they stress test their computing infrastructure to ensure these system-wide failures don’t happen or can be recovered quickly. Alas, they never simulated the loss of their entire network backbone. They also never put together a scenario where operator error would bring down their network. That should have been part of these drills and you can bet that they will be in the near future.
What can you do to prevent a similar situation?
To prevent a colossal outage in your own network, you can get serious about DNS protection by carrying out the following duties:
- Take stock about how you operate your DNS and follow some of the suggestions in our earlier blog post about ways to secure it. You might want to consider one of the numerous DNS hosting providers mentioned in the post.
- Examine your network servers, and take stock of the applications (besides your website) that run on them. Do any of your building services (such as physical access control or HVAC or security cameras) require your domain name to be up and running? If you suffer a DNS outage or a DNS hijacking, will your engineers (or contractors or other technical support staff) be able to enter your facilities and fix the problem? This is what happened at Facebook, which had a very sophisticated physical access control on its data centers rendered useless because of the combination of BGP and DNS issues.
- Do some research: Does your email have a single point of failure? Even if you use Microsoft or Google’s servers to support your email, do you have an independent backup system that can be used for your staff to communicate? This is more appropriate in case of a natural weather disaster, but it could be caused by criminals targeting your business.
- Identify potential weak spots: Finally, does your phone system depend on your domain to be up and operational? If not, ensure that your technical team has alternative contacts (such as cell or home phone numbers) to reach the necessary parties.
How to change your router DNS settings and avoid hijacking
How do hackers get into a company’s network?