Cloud.gov and api.fr.cloud.gov Outage
Incident Report for cloud.gov
Postmortem

On October 21, 2023, the platform experienced a partial outage due to a sustained increase in traffic. In response to this incident, the cloud.gov immediately prioritized work to mitigate the effects of traffic surges on the platform.

While the team did add valuable protections to the platform as part of that work, it was also a complex process due to the multi-tenant nature of cloud.gov and the associated difficulty of ensuring that legitimate traffic is not blocked by any protections against malicious traffic.

On October 27, 2023, the team received reports that some legitimate traffic to the platform was being blocked and began investigating. Once the causes of the traffic interruptions were identified, the team immediately applied the fixes so that the legitimate traffic could be restored.

Unfortunately, in the process of adjusting the web application firewall (WAF) rules that protect the platform from malicious traffic, around 1:35 PM ET an engineer made a change that blocked traffic from any IP that was not in the internal IP CIDR ranges or public egress IP CIDR ranges for cloud.gov. Since customer traffic cannot come from these IP ranges, the effect of this change was to block almost all traffic into the platform.

In response to customers reporting outages for their sites and the team’s independent confirmation of a platform-wide outage, the problematic WAF rule was disabled around 1:38 PM ET and customer traffic was immediately restored.

As part of our normal post-incident process, the cloud.gov has conducted a post-mortem for this incident and determined that its primary causes were:

  • Making changes to WAF rules directly in the production environment without promoting and testing them in lower environments first
  • Complexity of managing multiple conditions on firewall rules
  • Engineer fatigue and exhaustion from responding to multiple recent incidents
  • Difficulty of testing WAF rules in lower environments prior to production

To address these issues, the team will pursue the following changes to our systems and processes:

  • Make sure to always promote WAF changes through lower environment using normal CI deployment processes
  • Make sure to rotate team members doing incident response every 48 hours at least

As always, thank you for being a cloud.gov customer. If you have any questions, don’t hesitate to contact us at support@cloud.gov.

Posted Nov 01, 2023 - 17:10 EDT

Resolved
From approximately 11:34 AM ET – 1:38 PM ET, while attempting to mitigate previous DDOS attacks, new WAF rules were added to the platform load balancers. This resulted in some traffic which was targeting api.fr.cloud.gov to be blocked. An additional change at 1:34 PM ET caused access to a majority of the platform to be blocked until 1:38 PM ET.

The outage was resolved when the WAF rule changes were reverted and deployed into production at 1:38 PM EDT.

Timeline:
11:31 AM ET: An internal cloud.gov tool began failing and alerting the platform team to investigate the failure.

12:15 PM ET: A small subset of cloud.gov customers connecting to api.fr.cloud.gov from within the platform began to notice failures to connect.

1:30 PM ET: The platform team began to investigate the latest changes to the WAF rules as a possible problem.

1:35 PM ET: Customers notified us they could no longer access cloud.gov or access api.fr.cloud.gov.

1:38 PM ET: The WAF rules changes were reverted and functionality to the platform was restored.

Update to this incident - post this notice some additional customers notified us that a large portion of the platform lost access to their applications but access was restored. This happened during the 1:34 to 1:38 window EDT.
Posted Oct 27, 2023 - 11:30 EDT