Facebook has admitted buggy auditing code was at the core of yesterday’s six-hour outage – and revealed a little more about its infrastructure to explain how it vanished from the internet.
In a write-up by infrastructure veep Santosh Janardhan, titled “More details about the October 4 outage,” the outrage-monetization giant confirmed early analyses that Facebook yesterday withdrew the border gateway protocol (BGP) routing to its own DNS servers, causing its domain names to fail to resolve. That led to its websites disappearing and apps stopping all while internal tools and services broke down as well.
But this DNS and BGP borkage turns out to have been the consequence of other errors. Janardhan explained that it operates two classes of data center.
One type was described as “massive buildings that house millions of machines,” performing core computation and storage tasks. The other bit barns are “smaller facilities that connect our backbone network to the broader internet and the people using our platforms.”
Users of Facebook’s services first touch one of those smaller facilities, which then send traffic over Facebook’s backbone to a larger data center. Like any complex system, that backbone is not set-and-forget – it requires maintenance. Facebook stuffed that up.
“During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network,” Janardhan revealed.
That should not have happened. As the post explains:
Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command
Once the command was executed, it “caused a complete disconnection of our server connections between our data centers and the internet,” Janardhan added.
Which was problematic. Facebook’s smaller bit barns handle DNS queries for facebook.com, fb.com, instagram.com, etc. “Those translation queries are answered by our authoritative name servers that occupy well-known IP addresses themselves, which in turn are advertised to the rest of the internet via … BGP,” as Janardhan put it.
Crucially, Facebook’s DNS servers disable their BGP advertisements when those machines can’t reach their own back-end data centers. That’s fair enough as this unavailability could be a sign of duff connectivity, and you’d want to advertise routes to DNS servers that have robust links to their major centers.
So when the bad change hit Facebook’s backbone, and all the data centers disconnected, all of Facebook’s small bit barns declared themselves crocked and withdrew their BGP advertisements. So even though Facebook’s DNS servers were up, they couldn’t be reached by the outside world. Plus, the back-end systems were inaccessible due to the dead backbone, anyway. Failure upon failure.
Couldn’t happen to a nicer bunch of blokes
While Facebook’s post says it runs “storm” drills to ready itself to cope with outages, it had never simulated its backbone going down. Fixing the outage therefore proved … challenging.
“It was not possible to access our data centers through our normal means because their networks were down, and … the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this,” Janardhan stated.
Engineers were dispatched to Facebook facilities but they’re “designed with high levels of physical and system security” that makes them “hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access.
It took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers
“So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.”
This follows reports of employees’ door keycards not even working on Facebook’s campuses during the downtime let alone internal diagnosis tools. Once admins figured out the networking problem, they had to confront the impact of resuming service:
“We knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.”
The post doesn’t explain how Facebook addressed those issues.
Janardhan said he found it “interesting” to see how Facebook’s security measures “slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making.”
He owns those delays. “I believe a tradeoff like this is worth it – greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this,” he wrote.
The post concludes with Facebook’s usual admission of error despite earnest effort, apology, and pledge to improve. We’re assuming the social network is telling the truth in its write-up.
Facebook is not alone in breaking itself or having unhealthy reliance on its own resources: a massive AWS outage in 2017 was caused by a single error, and IBM Cloud’s June 2020 planet-wide outage was exacerbated by its status page being hosted on its own infrastructure, which left customers completely in the dark about the situation.
Site reliability engineers should know better. Especially in Facebook’s case, as it was unable to serve ads for hours, its federated identity services are used by countless third-party web sites, and the company has positioned itself as the ideal source of everyday personal and/or commercial communications for literally billions of people.
But as whistleblower Frances Haugen told US Congress, Facebook puts profit before people, many of its efforts to do otherwise are shallow and performative, and its sins of omission are many and constant. ®