On October 21, 2023, the platform experienced a partial outage due to a sustained increase in traffic. In response to this incident, the cloud.gov immediately prioritized work to mitigate the effects of traffic surges on the platform.
While the team did add valuable protections to the platform as part of that work, it was also a complex process due to the multi-tenant nature of cloud.gov and the associated difficulty of ensuring that legitimate traffic is not blocked by any protections against malicious traffic.
On October 27, 2023, the team received reports that some legitimate traffic to the platform was being blocked and began investigating. Once the causes of the traffic interruptions were identified, the team immediately applied the fixes so that the legitimate traffic could be restored.
Unfortunately, in the process of adjusting the web application firewall (WAF) rules that protect the platform from malicious traffic, around 1:35 PM ET an engineer made a change that blocked traffic from any IP that was not in the internal IP CIDR ranges or public egress IP CIDR ranges for cloud.gov. Since customer traffic cannot come from these IP ranges, the effect of this change was to block almost all traffic into the platform.
In response to customers reporting outages for their sites and the team’s independent confirmation of a platform-wide outage, the problematic WAF rule was disabled around 1:38 PM ET and customer traffic was immediately restored.
As part of our normal post-incident process, the cloud.gov has conducted a post-mortem for this incident and determined that its primary causes were:
To address these issues, the team will pursue the following changes to our systems and processes:
As always, thank you for being a cloud.gov customer. If you have any questions, don’t hesitate to contact us at support@cloud.gov.