Some customers received HTTP 5XX errors accessing their applications

Incident Report for cloud.gov

Postmortem

As part of our normal post-incident process, cloud.gov has held an incident postmortem and determined that the primary causes of this incident were:

A significant and prolonged increase in traffic to the platform, which lasted from roughly 6 PM to 10:25 PM EDT on October 21
Degradation in the performance of load balancers due to the volume of traffic
Delayed or failed responses from routing infrastructure to incoming requests due to the platform traffic surge

In order to mitigate the causes of this incident and to help prevent a similar event in the future, the team has identified several changes to make to our system and our processes:

We will implement more effective rate limiting for requests to our platform so that impact of traffic surges may be mitigated
We will explore implementing AWS Shield Advanced as an additional layer of protection on the Cloudfront CDNs brokered via the platform
We will add additional dashboards and monitoring to help identify future traffic surge events in real time so that we can respond proactively instead of reactively
We will work to better separate the network paths for different types of traffic within our platform so that we can apply better targeted mitigations for malicious traffic
We will work to provision separate routing infrastructure for different types of traffic to our platform
We will publish information about the DDoS protections included in our platform on cloud.gov

Thank you. As always, please feel free to reach out to our support team at support@cloud.gov if you have questions or concerns.

Posted Nov 01, 2023 - 14:34 EDT

Resolved

The cloud.gov platform experienced a burst of traffic to the platform which caused some customer applications to experience HTTP 5XX messages while trying to access their applications. This occurred on 10/21 between 18:30 to 19:30 and again from 20:00 to 22:00 EDT.

At this time the cloud.gov support team has identified the issue, is monitoring for future events, and is working on implementing additional solutions into production to mitigate future events. At this time the platform is fully available and during the event no customer applications were down, just a subset of applications were not accessible from the internet.

If you have any questions or concerns please reach out to cloud.gov support at support@cloud.gov

Posted Oct 21, 2023 - 20:30 EDT