The Facebook Outage and the Case for Cyber-Resilience

Written by

Early reports of the Facebook outage were quick to include the comment that the outage was not due to a cyber-attack, the implication being that it was somewhat less worrisome than if it were. Millions of users, including small businesses that rely on Facebook services for their daily operations, and people in parts of the world in which these services are the primary means of reliable communication, were cut off from a vital resource. It was just a mistake, no need for concern. Facebook is protected from attacks. Yet, the impact of this mistake was real and widely felt, arguably more widely felt than most of the cyber-attacks that gain so much attention. This raises an interesting question: are we worried about the wrong thing?

The crime-fighting attack-and-defend language of cybersecurity has directed our attention toward addressing the malicious actions that lead to cyber-breaches, data losses and denial of service rather than addressing the consequences of those actions. Instead, we should make sure our systems stay operational or can quickly return to operational health and maintain our data integrity, regardless of the form of attack. In fact, when looked at in terms of system impact, an attack is the same as a mistake, power outage or earthquake. Actions taken to protect against the consequences of an attack can also address the recovery from these other disruptions. This is cyber-resilience.

Cyber-resilience is a company’s ability to minimize impact and recover if systems or data have been compromised. Cyber-resilience covers adversarial threats such as hackers and other malicious actors and non-adversarial threats such as human error, natural disaster or failures in interrelated systems. Regardless of the cause of the problem, resilient protections minimize the effect.

"Cyber-resilience is a company's ability to minimize impact and recover if systems or data have been compromised"

To be sure, one dimension of resilience lies in protecting against particular causes of failure and cause-specific solutions are needed for this protection. So, for example, you use a surge protector to guard against system threats from lightning and a virus checker to guard against the threat from some cyber-attacks. (To be fair, Facebook had some audit functions in place that were meant to protect against error though they proved inadequate.) However, the impact minimization and recovery aspects of resilience can generally address outages regardless of cause. Yet, this dimension of the problem, which is boring system design and management stuff, typically receives less attention and less corporate investment than the more manly and exciting attack response provided by cybersecurity mechanisms.

Some of this disparity can be attributed to the way enterprises are organized. Generally, the resilience features of availability, reliability and recovery are the purview of the network or infrastructure departments, while vulnerability to attack is the domain of the security department. Departments often compete for funds, and requirements between departments are often thrown over the wall with little concern for their impact on other departments. In this way, we institutionalize the classic system engineering problem of not considering all concerns jointly. There was a telling statement at the end of Facebook’s October 5 blog posting on the cause of the network outage:

“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a trade-off like this is worth it – greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this.”

I think it is worth discussing whether the trade-off between security and recovery is even necessary, let alone worth it. Perhaps our artificial separation of cybersecurity into a thing in itself rather than an aspect of a unified response to risk is forcing us to make bad bargains.

What’s hot on Infosecurity Magazine?