As part of our normal post-incident process, cloud.gov has held an incident postmortem and determined that the primary causes of this incident were:
- A significant and prolonged increase in traffic to the platform, which lasted from roughly 6 PM to 10:25 PM EDT on October 21
- Degradation in the performance of load balancers due to the volume of traffic
- Delayed or failed responses from routing infrastructure to incoming requests due to the platform traffic surge
In order to mitigate the causes of this incident and to help prevent a similar event in the future, the team has identified several changes to make to our system and our processes:
- We will implement more effective rate limiting for requests to our platform so that impact of traffic surges may be mitigated
- We will explore implementing AWS Shield Advanced as an additional layer of protection on the Cloudfront CDNs brokered via the platform
- We will add additional dashboards and monitoring to help identify future traffic surge events in real time so that we can respond proactively instead of reactively
- We will work to better separate the network paths for different types of traffic within our platform so that we can apply better targeted mitigations for malicious traffic
- We will work to provision separate routing infrastructure for different types of traffic to our platform
- We will publish information about the DDoS protections included in our platform on cloud.gov
Thank you. As always, please feel free to reach out to our support team at support@cloud.gov if you have questions or concerns.