AWS Outage Ends After Worldwide Disruption, But Cleanup Continues

The AWS outage that caused widespread global disruption for about half of Monday has officially been resolved, but numerous lingering issues remain that could take days or even weeks to fully resolve.

The outage began not long after dawn in the Pacific time zone on October 20 and lasted about 12 hours, hobbling a wide variety of websites and apps that rely on Amazon’s cloud services. The problem originated from US-EAST-1, an Amazon cluster located in northern Virginia that has already had two other major malfunctions that have caused global disruptions within the last five years.

AWS outage caused by Amazon’s largest (and oldest) web services cluster

The AWS outage was the largest global internet disruption since the similar Crowdstrike failure of July 2024, which was traced to an errant security software update. That incident impacted banking, health care, airlines and shipping among a wide variety of industries, and also caused assorted knock-on issues that took an extended period of time to resolve.

Amazon’s lapse was likely even more disruptive, however. It’s almost easier to list the sites and apps that did not experience problems from the AWS outage. Delta Airlines, one of the companies that was hardest hit by last year’s Crowdstrike issue, was once again included (along with chief rival United) as customers reported issues with the mobile app and website. Amazon’s own shopping site and app experienced performance problems as did Alexa and Ring devices. There was also impact to finance and banking as Robinhood, Venmo, Coinbase and Chime all had issues. While Verizon and AT&T’s lines did not go down, customers of both companies reported issues with assorted online services with AWS dependencies; the UK’s Vodafone and BT were similarly impacted. Even Google is heavily tied into the AWS cloud world and felt some impact: issues and errors were reported with Google Maps, Youtube, Gmail and the core search product. A complete list of AWS outages, based on Downdetector and user reports, is available from Dataconomy.

The AWS outage stemmed largely from the service’s widespread use for applications and processes hosting. The more specific technical problem was an issue with the Virginia cluster’s DynamoDB API, with hosted processes suddenly unable to locate the correct address for the Domain Name System (DNS) for reasons that Amazon has yet to specify other than to suggest the “Elastic Compute Cloud” on-demand cloud computing service’s EC2 internal network had some sort of issue with its load balancing ability. AWS services generally default to the impacted US-EAST-1 site.

The two prior AWS outages involving this cluster took place in 2020 and 2023. Both issues were a little shorter, lasting several hours at most but impacting a similarly broad range of organizations. That is hardly a comprehensive list of AWS outage issues of the previous two decades, however. There was another broad-ranging outage involving the US-East 2 cluster in Ohio in late 2022, though this only lasted for 40 minutes. About a dozen other known outages at various locations span back to 2007, just after the AWS service was first launched.

Outage now “fully mitigated,” but struggles remain

Some estimates put the total loss of the 12-hour AWS outage at hundreds of billions of dollars due to disruptions to business operations, loss of productivity and need to issue refunds and credits to impacted customers.

Legal action may also be an issue. Last year’s Crowdstrike outage provides some precedent for this, with Delta continuing to battle the company in court over a claimed $500 million in losses. That case has some key differences, however, most centrally Delta’s claim that most of its loss was due to having to manually reset some 40,000 servers over the course of several days due to the defective security update pushed to them.

At minimum, many customers impacted by the AWS outage will have to spend some time catching up on backlogs of missed payments, messages and work assignments generated over the course of a normal workday. The impacted airlines were forced to delay some flights, and some Amazon Prime customers took to social media to report that their package delivery dates were pushed back during and just after the incident (though neither of these issues appear to have been widespread).

One of the most serious impacts, and one that highlights customer need to have alternative backup systems in place, is the temporary outage of the California eFileCA and Odyssey platforms used for court filings in both criminal and civil cases that can have very strict deadlines. Odyssey additionally reported that its support phone lines were down for some time along with its file upload tools, requiring all communications to go through several support emails.

Aras Nazarovas, senior security researcher at Cybernews, notes some possible legal ramifications: “From initial reporting there are no indications of any security breach, however failing to keep information or resources available for clients can be classified as a cyber incident, even if there was no malicious outsider or malicious intent. Similar outages occur almost every year, and they can be a reminder of how extensive software supply chains have become, showing how a simple issue on a handful of Amazon Data Centers caused thousands of issues to their clients. Clients of affected services were impacted by failing to access their resources and data hosted by AWS for ~4hours impact of such a failure to ensure availability can vary greatly depending on the specific business and industry that used impacted AWS services, in worst case scenarios such an outage could have had serious consequences in critical infrastructure sectors. In the event of such disruptions users should immediately seek alternative solutions for communication (different app, phone calls, SMS, radio) to be able to coordinate next steps towards recovering from such a disruption. It is a good practice to have a “Disaster Recovery Plan” where alternative communication channels and other critical steps have been planned in advance.”

Jeremy Turner, VP of Threat Intelligence and Research at SecurityScorecard, adds: “Today’s AWS outage is a stark reminder that cloud resilience is national resilience. A single disruption can ripple across critical services, finance, and infrastructure. We’re focused on helping organizations gain continuous visibility into their digital supply chain because resilience starts with knowing where you’re vulnerable. Resilience is measured in how many more nines come after 99.9%, but it’s never 100%. The cloud delivers incredible uptime, but it’s also a massive source of risk aggregation. In this case, a database issue with DynamoDB, the system storing DNS records for a significant share of the internet, shows just how much scale amplifies both impact and recovery. When systems and datasets are this large, restoring service within hours is, frankly, remarkable.”

And Sergiy Balynsky, VP of Engineering at Spin.AI, notes that this should be taken as a prompt to review resiliency: “The AWS outage is a reminder that business continuity planning isn’t optional. Organizations should maintain independent backups and diversify across multiple cloud providers – so a disruption in one platform doesn’t bring operations to a halt. Even the most reliable clouds can fail. A strong business continuity plan should include not only reliable backups, but also cross-platform and multi-cloud redundancy to minimize business disruption and maintain access to critical data when one provider experiences downtime.”