AWS Outage: Amazon Releases Full Root Cause Report
The AWS outage earlier this week crippled many internet services. Amazon has now published a comprehensive report.
(Image: JHVEPhoto/Shutterstock.com)
The outage of Amazon's AWS cloud services on Monday this week caused sleepless nights not only for IT experts but also, for example, for owners of connected mattresses. Amazon's technicians have now published a complete analysis of the incidents, explaining how such widespread disruptions could occur.
Even the title of the Amazon analysis points to a single point of failure: "Summary of the Amazon DynamoDB service disruption in the Northern Virginia region (US-EAST-1)". The error that occurred there not only caused outages of Amazon services such as streaming with Prime or Amazon Music but also crippled the messenger Signal for hours. It is all the more exciting to understand how this happened.
AWS Outage: Detailed Error Log
While the technicians at restoration of normal operations, which Amazon announced shortly after midnight on Tuesday, October 21st, that they had already provided an initial brief summary of the incident, the current analysis goes into considerable depth. From Amazon's perspective, the error cascade occurred in three stages. Between 8:48 AM CEST on October 20, 2025, and 11:40 AM, Amazon's DynamoDB reported increased error rates for API accesses in the Northern Virginia region (US-EAST-1). According to Amazon, this was the first phase of the disruption.
The second phase occurred between 2:30 PM and 11:09 PM, during which the Network Load Balancer (NLB) experienced increased connection errors on some load balancers in Northern Virginia. This was due to health check failures in the NLB fleet, which led to these increased connection errors. Additionally, the third phase occurred from 11:25 AM to 7:36 PM, during which starting new EC2 instances failed. Some of the EC2 instances that started after 7:37 PM, however, experienced connection problems that lasted until 10:50 PM.
DynamoDB Error
Amazon explains the problems with DynamoDB as a "latent defect" in automatic DNS management, which caused the name resolution of endpoints for DynamoDB to fail. "Many of AWS's largest services rely heavily on DNS to ensure seamless scalability, fault isolation and recovery, low latency, and locality. Services like DynamoDB manage hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each region," the technicians write. Automation is necessary to add additional capacity when available and to handle hardware failures correctly, for example. Although the system is designed for resilience, the cause of the problems was a latent race condition in the DynamoDB DNS management system, which resulted in an empty entry for the regional endpoint "dynamodb.us-east-1.amazonaws.com". Interested parties can gain deep insight into the DNS management structure in the analysis.
Both internal traffic and customer traffic relying on DynamoDB were affected, as they could no longer connect to DynamoDB due to incorrect name resolution. At 9:38 AM, IT staff found the error in DNS management. Initial temporary countermeasures took effect at 10:15 AM, enabling further repairs, so that by 11:25 AM all DNS information was restored.
Videos by heise
EC2 Instances Not Starting
EC2 instances stopped starting at 8:48 AM because they are distributed to various servers by the Droplet Workflow Manager (DWFM). The DWFM monitors the status of EC2 instances and checks if shutdown or reboot operations, for example, proceeded correctly in so-called leases. These checks are performed every few minutes. However, this process depends on DynamoDB and could not be completed successfully due to its disruption. Status changes require a new lease. The DWFM attempted to create new leases, but these increasingly timed out between 8:48 AM and 11:24 AM. After DynamoDB became reachable again, EC2 instances could not start due to "insufficient capacity errors". The backlog of pending leases created a bottleneck, so new requests still resulted in timeouts. By 2:28 PM, the DWFM was able to restore all leases to all instances called Droplets after restarting the DWFMs cleared the queues. However, due to a temporary request throttling implemented by IT staff to reduce overall load, error messages such as "request limit exceeded" occurred.
The Droplets/EC2 instances receive configuration information from a Network Manager that allows them to communicate with other instances in the same Virtual Private Cloud (VPC), with VPC appliances, and the internet. The distribution had created a large backlog due to the problems with the DWFM. From 3:21 PM onwards, this led to significant latency. Although new EC2 instances started, they could not communicate on the network due to invalid network configurations. IT staff managed to resolve this by 7:36 PM, so that EC2 starts proceeded "normally" again.
Connection Errors Due to Network Load Balancer
AWS Network Load Balancers (NLBs) use a monitoring system (health check system). They comprise load balancing endpoints and distribute traffic to backend systems, which are typically EC2 instances. The health check system regularly checks all NLB nodes and removes any systems that are identified as "unhealthy". However, the checks increasingly failed during the disruption because the EC2 instances restarted by the health check system could not report their network status. The checks failed in some cases, even though the NLB nodes and backend systems were functioning correctly. The check results alternated between correct status and errors, causing NLB nodes and backend systems to be removed from DNS, only to be added back during the next successful run. This was noticed in network monitoring around 3:52 PM.
The load on the health check system increased due to the alternating check results, making it slower and causing delays in health checks. This eventually led to a reduction in load balancing capacity as these were taken out of service. This resulted in increased connection errors from applications when the remaining operational capacity was insufficient to handle the application load. At 6:36 PM, the IT team disabled the automatic checks for the NLBs, allowing all available, still functioning NLB nodes and backend systems to be put back into service. After the EC2 systems had also recovered, the health checks could be reactivated at 11:09 PM.
Amazon further discusses the timeline of disruptions for Amazon services dependent on the affected main systems. The IT team plans to implement some changes as a result of the major cloud outages. For example, the DynamoDB "DNS Planner" and "DNS Enactor" automation has been disabled worldwide. This will remain the case until, among other things, the encountered race condition has been resolved. Network Load Balancers will receive speed control that limits how much capacity a single NLB can remove after failed health checks. For EC2 systems, Amazon is developing additional test suites to analyze the DWFM workflow, for example, and to prevent future regressions.
(dmk)