Facebook outage – a classic case of DR failure?

The serious problem facing users was not just that direct access was down, but that all the ancillary links – such as 'like' buttons posted on many hundreds of thousands of sites – were also disabled, effectively causing problems for users of those pages too.

This second issue was caused because the API for Facebook was down, which is a very rare occurrence, Infosecurity understands.

So what happened? Why wasn't there as a disaster recovery (DR) system in place?

According to Robert Johnson, Facebook's director of software engineering, this was the worst outage the social networking site has seen in four years "and we wanted to first of all apologise for it."

"We also wanted to provide much more technical detail on what happened and share one big lesson learned. The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed", he said.

With more than more than half a billion users and several hundred million users reportedly accessing their accounts every day, security forum users surmised that the problem was so big that no DR system could have coped with the outage

Johnson's detailed explanation of the problem seems to confirm this supposition, as he noted that key flaw that caused the outage to be so severe was an unfortunate handling of an error condition.

"An automated system for verifying configuration values ended up causing much more damage than it fixed", he said, adding that the automated system aims to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store.

"This works well for a transient problem with the cache, but it doesn't work when the persistent store is invalid."

What triggered the systems failure, however, was a change to the persistent copy of a configuration value that was interpreted as invalid.

"This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second", he explained.

"To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key", he said.

"This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover", he added.

The solution to the feedback cycle was, he noted, "quite painful" as engineers had to stop all traffic to the database cluster, which meant turning off the site.

"Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site."

What’s hot on Infosecurity Magazine?