Partial outage for CDN-based traffic
Incident Report for cloud.gov
Postmortem

Incident timeline

At approximately 3:11 PM ET, a customer notified us that their site was intermittently failing to load.

The team immediately began investigating and found that the site in question was accessed through Amazon CloudFront. Further investigation revealed that the cause of the load issues was an improper configuration for identifying CloudFront traffic in our WAF rules, which was causing traffic from CloudFront to be rate-limited too aggressively.

At 3:21 PM ET, the team updated the configuration for identifying CloudFront traffic. After the change was made, the team did not observe any further improper rate limiting for CloudFront traffic.

Post-mortem analysis

As of November 3, 2023, cloud.gov has two separate types of rate limits on our platform:

  • For traffic coming through CloudFront, we rate limit with a CHALLENGE action by requests per forwarded IP address per 5 minutes
  • For traffic not coming through CloudFront, we rate limit with a CHALLENGE action by requests per source IP address per 5 minutes

The use of forwarded IP address vs source IP address for rate limiting is crucial for CloudFront traffic. The source IP address for AWS CloudFront traffic comes from an IP range for the CloudFront service itself, while the actual client making the request to CloudFront comes through as the forwarded IP (usually in an X-Forwarded-For header). Thus, rate limiting CloudFront traffic by source IP address may effectively rate-limit all traffic for a given CloudFront IP address, which could be the same IP address for all customers using CloudFront and result in requests for any site using CloudFront being denied intermittently.

In this case, the reason that CloudFront traffic on our platform got the wrong rate limit applied by source IP was unfortunately a simple misspelling in the User-Agent header value used to detect CloudFront traffic. The actual user agent for CloudFront traffic is Amazon CloudFront, whereas our rate limit rules were set to use a header value of Amazon Cloudfront (note the case-sensitive difference in the spelling of CloudFront). Once the team recognized this problem, the User-Agent header for identifying CloudFront traffic was fixed at 3:21 PM ET and the team saw no further improper rate limits.

Posted Nov 03, 2023 - 15:06 EDT

Resolved
Customers whose traffic passes through a Cloudfront CDN may have experienced intermittent failures when trying to load their applications or webpages.

The cause of the problem was an incorrect header value name used when applying rate limits to traffic. The problem has been resolved and customers using a CDN should no longer see degraded performance or failures from their applications.
Posted Nov 01, 2023 - 15:21 EDT