For approximately 20 hours, six customer domains hosted through our custom domain service returned 503 errors for all requests.
The Custom Domain Broker allows customers to route custom domains to Cloud Foundry domains on cloud.gov. It is ultimately backed by Amazon Application Load Balancers (ALBs), with each custom domain pointing to one specific ALB. The custom domain broker is responsible for creating, updating, and deleting instances of the service, as well as enforcing the limit of number of domains per ALB.
Amazon ALBs have one or more listeners that are configured via rules to route traffic to a target group containing one or more targets. In our case, the targets are virtual machines acting as Cloud Foundry routers.
2019-05-31 - We found that all of the ALBs available to the custom domain broker had as many domains provisioned to them as the custom domain broker would allow, so we scaled the number of ALBs using Terraform. After a restart to load the change, the broker began provisioning new domains to the new ALB.
2019-06-05 11:52 AM EDT - The first customer provisions a new domain that lands on this ALB, and alerts us that it’s taking too long to provision.
2019-06-05 1:57 PM EDT - We find that the target group for the new ALB had no targets registered, and manually add the Cloud Foundry routers as targets. We then began looking into why they did not get automatically added.
2019-06-06 11:00 AM EDT (approx) - We ran a task as a part of periodic platform maintenance that caused a rolling restart of the Cloud Foundry routers
2019-06-06 6:12:39 PM EDT - Cloud Foundry router rolling restart begins
2019-06-06 6:20:42 PM EDT - Cloud Foundry router rolling restart ends.
The new routers were automatically added to all but the newest ALB used by the custom domain service. At this time, all six of the domains hosted on the newest ALB started serving 503s to all requests.
2019-06-07 1:55 PM EDT - a user representing one of those domains contacted us to report problems with their domain.
2019-06-07 2:05:05 PM EDT Operations team received the request
2019-06-07 2:15:00 PM EDT (approx.) - The operations team began researching the issue
2019-06-07 2:30:17 PM EDT - The operations team identified the issue as the ALB. This is when we realized multiple users were affected by the issue reported.
2019-06-07 2:30:57 PM EDT - The operations team updated the target groups again manually.
2019-06-07 2:23:22 PM EDT - The operations team validated that traffic was working for the domain the user had reported, and continued to test the other domains
Our system is designed around infrastructure-as-code, and we strive to keep hand-built infrastructure out of production. When we updated this on Friday, we did so acknowledging that we still have configuration created by hand. We discussed this as a team and determined this was the most stable option. The process that updates this takes several hours to run, and we did not want to risk it causing issues while we were not around to correct them. Normally, we trust our automated builds and deployments, but we had already demonstrated that we did not fully understand the process for the scaling process for this application.
We are planning on updating this Tuesday, June 11th.
We learned about this problem when a customer alerted us to it, which is never how we want to detect problems. We have plans to add alerting to detect problems like this in the near future.
We learned that we did not understand the process of scaling up the ALBs going into this process, so we’re adding documentation to this process.
Even before this incident began, we had discovered that the code for the custom domain broker (and the CDN broker, which is nearly identical) is fragile, difficult to debug, and scattered across several repositories. We were already working on a redesign of the custom domain and CDN brokers, and this incident is informing the redesign. Specifically, we’re aiming to make the scaling process a single operation from the operator’s perspective.