Amazon Outage Caused by Simple Input Error

A major outage which struck Amazon’s US-EAST-1 region on Tuesday, rendering large swathes of the internet inaccessible, was caused by a simple input error on the part of an engineering team, AWS has revealed.

The cloud giant explained in a lengthy online post that a Simple Storage Service (S3) team was debugging an issue which had been causing the S3 billing system to slow.

It continued:

“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The bad news continued when it turned out that the servers inadvertently removed were supporting two other S3 subsystems. The index subsystem manages the metadata and location information of all S3 objects in the region, and the location subsystem “manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate.”

Amazon was forced to restart these two subsystems, also rendering a range of other services reliant on S3 for storage unavailable, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes when data was needed from a S3 snapshot and AWS Lambda.

AWS said it has not had to restart the index or placement subsystem for several years, during which time S3 has experienced massive growth which made the whole process, including checks on the integrity of metadata, take longer than expected.

The cloud giant said it is changing things to prevent a similar incident happening in the future, but for many it is a reminder of what can go wrong even in organizations with the resources of Amazon Web Services.

Apart from coinciding with Amazon’s AWSome day, designed to encourage UK start-ups to migrate to the cloud, reports suggest websites and services including Quora, Imgur, Github, Zendesk and Yahoo Mail went down or were patchy for several hours.

Gavin Millard, EMEA technical director of Tenable Network Security, argued that cloud services are usually less prone to downtime than on-premise set-ups, but can cause a domino effect when they do hit trouble.

“When migrating critical infrastructure to a cloud provider, it’s important to remember that whilst they have robust strategies for dealing with outages to core services, single points of failure can still impact availability," he added. "Spreading the workloads across multiple regions and having a plan in place to deal with catastrophic issues like S3 going down would be wise.”

Amazon Outage Caused by Simple Input Error

Phil Muncaster

You may also like

DDoS-ers Launch Attacks From Amazon EC2

Why Leaky Clouds Lead to Data Breaches

#NextGenResearch: Is There Enough Training to Work With IaaS?

#2018InReview Cloud Security

3.2 Million Files Revealed on AWS S3 Bucket

What’s hot on Infosecurity Magazine?

Microsoft Fixes Three Zero-Days in May Patch Tuesday

China Presents Defining Challenge to Global Cybersecurity, Says GCHQ

Ebury Botnet Operators Diversify with Financial and Crypto Theft

Hackers Use DNS Tunneling to Scan and Track Victims

UK Insurance and NCSC Join Forces to Fight Ransomware Payments

AI-Powered Russian Network Pushes Fake Political News

RSAC: CISA Launches Vulnrichment Program to Address NVD Challenges

RSAC: How CISOs Should Protect Themselves Against Indictments

Hackers Use DNS Tunneling to Scan and Track Victims

RSAC: Three Strategies to Boost Open-Source Security

Kaseya CISO on Preparing Effectively for the Next Cyber Incident

RSAC: Why Cybersecurity Professionals Have a Duty to Secure AI

Why DDoS Simulation Testing is Critical for Proactive Network Defense

Supply Chain Cybersecurity: How to Mitigate Third-Party Risks?

Is MFA Enough? Strategies for Next-Level Identity Security in 2024

Disinformation Defense: Protecting Businesses from the New Wave of AI-Powered Cyber Threats

Adapting to Tomorrow's Threat Landscape: AI's Role in Cybersecurity and Security Operations in 2024

How to Secure Remote Connectivity within Operational Technology Environments

Women in Cybersecurity at Infosecurity Europe 2024

RSAC: CISA Launches Vulnrichment Program to Address NVD Challenges

LockBit Leader aka LockBitSupp Identity Revealed

How to Proactively Remediate Rising Web Application Threats

Learn from the NHS - Proactive Password Security for Improved Cybersecurity

Live Roundtable Event: Secure Enterprise Browsing, New Ways to Strengthen Endpoint Security

Amazon Outage Caused by Simple Input Error

Written by

You may also like

What’s hot on Infosecurity Magazine?