Cloudflare and the Art of Owning Your Mistakes

If you happened to visit Discord, OkCupid, CoinDesk, or several other popular websites earlier this month, you might have been greeted with a 502 Gateway Error message. This wide-ranging web blackout was caused by an outage of network service provider Cloudflare. Internet users weren’t even able to check popular internet performance site DownDetector, as the site itself was downed by the outage

The industry was quick to speculate the outage was caused by a hostile DDoS attack. Unsurprising given massive internet outages have become synonymous with DDoS attacks in recent years—a key case being the 2016 Dyn cyber-attack, where a series of DDoS attacks targeting DNS systems caused a similar major internet outage.

As for Cloudflare’s response, the company’s CEO, Matthew Prince, was quick to provide updates via Twitter. In the hours following the news, Prince confirmed the outage was caused by a massive spike in CPU usage, and quickly allayed users who presumed it was caused by an attack.

We then saw Cloudflare CTO John Graham-Cumming publish a company blog confirming the outage was caused by a single misconfigured rule within the Cloudflare firewall reacting poorly to a standard rules update, causing the CPU of the company’s machines to spike to 100%.

While internet outages are frustrating for developers and internet users alike, the transparent way in which Cloudflare handled its outage deserves serious praise. Some companies may shiver at the thought of disclosing the technical details and cause of a network outage, whether it be the potential financial implications or just sheer embarrassment.

The fact is, though, customer loyalty and trust is more likely to be earned by companies willing to be fully transparent when an issue occurs. Doing so doesn’t take away the harm and inconvenience of an outage, but it does demand respect. The positive reaction online to Cloudflare’s handing of the outage is a testament to this. 

Being open also shows that companies rightly view an outage as more than just an IT issue. These situations ultimately have a wide-reaching impact on end users, and it’s only right to acknowledge these end users by involving them in the aftermath.

By having both its CEO and CTO respond to their network outage, Cloudflare successfully showed how seriously they regarded the matter. It’s also worth noting that Cloudflare isn’t bound to disclose security breaches the way European companies are. Despite this, they still provided clear statements—truly leading by example.

While Cloudflare’s response was commendable, the causes of the outage should still be assessed. The company has already admitted its testing process before the downtime was insufficient and it’s now looking to improve these processes. This is a welcome step; constant testing is a must in ensuring networks are completely secure. It’s only through testing that network vulnerabilities and misconfigured rules are uncovered and addressed. 

The outage also reinforces a message all IT pros should already be familiar with: network monitoring is just as important as establishing network defenses. While defending against external threats should be a priority for IT pros, so should the monitoring of networks with the correct tools and software.

A lot can be learned from the recent Cloudflare episode. Approaching the fallout of an outage in a transparent and conscious way is something all companies should aspire toward. The outage also demonstrates the damage internal IT errors can inflict. Cloudflare had the strong network visibility needed to quickly locate and address the cause of their error—not all IT pros will have this visibility. If there was ever a call to action for network monitoring, this is it.

What’s Hot on Infosecurity Magazine?