Reliability « Longair.ca

Reliability

Oct/21

The Facebook Outage of 2021/10/04

So, Facebook went down last week. Hard down. Unreachable to anyone levels of down, for about six hours. And from all the publicly available information I’ve seen, it looks like it was an accident. So what happened?

Facebook posted an article about this on the 5th, and while it’s a decent explanation, I don’t know that it does the best job of explaining things to non-technical folks. The internet is a complicated topic, especially once you get below the surface, so I will try to break things down into simpler chunks.

How do I normally access Facebook?

You’re probably familiar with typing “www.facebook.com” into your browser and then getting your social media fix. Whether you access their service using a Web browser such as Chrome, Edge, or Firefox, or the Facebook app on your mobile phone, the same general process happens behind the scenes to connect you.

Step 1, your device needs to know where to connect to. Low-level computer networking works based on numbers called IP (Internet Protocol) addresses, not domain names. Domain names are a helpful abstraction for humans because “facebook.com” is much easier to remember than “157.240.3.35.” There are other benefits of this abstraction, but they’re not relevant to what we’re talking about today. So much like the olden days of dialing telephone numbers by hand, your computer needs to know what IP address to contact. The metaphorical phone book used for the internet is called DNS (Domain Name System). Unlike traditional phone books, it’s a distributed system – this will come into play later.

Step 2, once your computer has found out what “number” to “dial” for Facebook, there’s the matter of actually getting your request there. Like telephone service, there’s a wide array of equipment in between you and your destination, and they use internationally-standardized protocols to determine how to perform the task. Relevant to this discussion is BGP (Border Gateway Protocol). This is how equipment can say to each other, “I can get your request to the IP address you’re trying to reach.”

Step 3, once you’ve figured out what number to dial and the internet’s infrastructure has delivered your request, you have reached Facebook and can start getting the latest cat pictures uploaded by your friends, post links to YouTube, share spicy memes, whatever it is you do on there. Preferably not spreading anti-vax propaganda, but that’s your decision.

What went wrong?

The morning of the 4th, Facebook went offline. At first, the community thought it was a DNS issue, a misconfiguration of some kind. Remember how I said above that DNS is a distributed system? Your device asks “what is the IP address for www.facebook.com” to a local DNS server operated by your service provider, which probably maintains a cache of recently looked up domain names. If your local server doesn’t have the address for the name you’re trying to reach, the DNS protocol goes through a well defined lookup path to eventually find the server which is authoritative for “www.facebook.com.” Early during this incident, the authoritative server for Facebook’s services became unavailable. So unless your device had cached the address for Facebook, you became unable to try and establish a connection.

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.
Facebook engineering blog

The authoritative DNS server for Facebook became unavailable because internet infrastructure became unable to contact Facebook’s networks. This is because automated systems under Facebook’s control stopped advertising the path to reach their networks. Which meant even if you had the IP address for Facebook somehow (most likely cached on your device), you still wouldn’t be able to get there.

To continue with the telephone service metaphor, they accidentally ripped out the page of the phone book with their number in it, then physically disconnected the phone line to their office.

Why did this happen?

To me, this is the most interesting question through all of this. I assume Facebook, like any internet company which cares about its availability and reliability, has safeguards in place to prevent accidental destructive changes, or a single malicious actor trying to sabotage. So something really must have gone sideways. Facebook says:

During one of these routine [network] maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.
Facebook engineering blog

So yes, they had safeguards in place. Unfortunately, software engineering is hard, especially when managing huge networks that serve billions of requests per day. And despite the best efforts of very smart people, mistakes do happen. This mistake resulted in a very public outage at a very poor time (PR-wise) for the company.

The relevant engineering teams will be looking over every piece of network logs, configuration files, source code for their audit tools, etc. to understand what happened and figure out how to reduce the chance of it happening again in the future. I hope they post more technical details about what happened for the reliability community to really dig into and learn from – this probably won’t happen, as much as I would enjoy it.

Why six hours?

There’s two major factors at play here – security concerns and circular dependencies. I’ll start with the security concerns.

From trusted source: Person on FB recovery effort said the outage was from a routine BGP update gone wrong. But the update blocked remote users from reverting changes, and people with physical access didn’t have network/logical access. So blocked at both ends from reversing it.
Brian Krebs, independent journalist , via Twitter

it was not possible to access our data centers through our normal means because their networks were down
Facebook engineering blog

A common idiom in information security is, once your attacker has physical access to your system it’s all over. So high-value systems tend to have measures in place to make it harder for a malicious actor to take advantage of a system if they have physical access. In this case, the people who maintain the physical network equipment didn’t have the necessary authorization to effect repairs on the configuration of the equipment, and at the same time the engineers who were authorized to effect repairs on the configuration were unable to because the network connections they’d ordinarily use to reach that equipment were down. The authorization in question is enforced in software, it’s not like Mark Zuckerberg or any other executive could have granted verbal or written permission for the onsite technicians to do what had to be done to fix things.

Next, circular dependencies. This concept is pretty dangerous in the software world. If system A depends on system B to function, which in turn depends on system C, which then depends on system A, what happens when system B goes down? System A is going to stop working properly, which will have a knock-on effect on system C, which will make it harder to restore system B.

Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
Sheera Frenkel, NYT tech reporter, via Twitter

Source at Facebook: “it’s mayhem over here, all internal systems are down too.” Tells me employees are communicating amongst each other by text and by Outlook email.
Philip Crowther, AP reporter, via Twitter

the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.
Facebook engineering blog

Facebook employees needed access to internal systems and tools to diagnose the problem and effect repairs. But the network was down, so they couldn’t access those systems and many of the tools were broken. And in order to repair enough of the network to restore those tools, they needed physical access, which was not trivial. I speculate the reason they were unable to enter facilities is because their badging system depends on the network, which was down, to validate authorizations before unlocking doors. And there probably aren’t very many physical keys to around to operate the locks manually.

Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.
Facebook engineering blog

To summarize:

Access to the buildings required the network to be functional.
Fixing the network required certain internal tools.
Those internal tools required some minimum threshold of network functionality.
Restoring that minimum threshold of functionality using alternate means was hard because they needed to bootstrap enough network to bring their authorization mechanisms back.

Would an emergency “break glass” procedure have let them fix things faster? Probably, but then that procedure is a new angle of attack they’d need to put resources into protecting.

Will something like this happen again?

Hopefully no. Probably yes.

It’s impossible to say to which company it will happen, when it will happen, or how long that outage will last. The fact is, the internet is a very complex system. Many brilliant professionals have put in lots of hard work to build safeguards to reduce the likelihood of failures, the impact of failures when they occur, and make it easier to understand and repair them. Errors still happen. Every time something like this happens, we can learn from it and use that learning to make our systems more resilient going forwards.

facebook, networking, outage Leave a Comment more...

Longair.ca

Reliability

The Facebook Outage of 2021/10/04

How do I normally access Facebook?

What went wrong?

Why did this happen?

Why six hours?

Will something like this happen again?

Categories

Links

Meta

Recent Posts

Archives

Tags