cloud.gov customer applications and platform returning 503 errors

Incident Report for cloud.gov

Postmortem

What happened

On Tuesday January 9 from 17:09 EST to approximately 23:42 EST (6 hours and 33 minutes), the cloud.gov platform and customer applications were unavailable. This was the longest and most significant outage in the history of our platform. No data was lost.

The initial cause: Our team was developing a new feature in our development environment, which is separated from our staging and production environments using the Amazon Virtual Private Cloud feature. This work included running a vendor software package that created and deleted virtual machine instances to complete its tasks. Creating and deleting virtual machines is an operation that can’t be logically restricted by Virtual Private Clouds because of the structure of AWS services and permissions. At 17:09, the package’s cleanup task ran, which is intended to remove the virtual machine instances it created. This had an unexpected effect: it removed all of the virtual machine instances in the AWS account, including production instances. This happened because the task was not written to only delete the virtual machine instances that it had created. This did not affect databases or storage volumes.

We immediately received an alert and identified the cause of the deleted instances. We have a documented process for completely re-deploying the platform, and we began following that process. This is a “bootstrapping” process, where we sequentially deploy pieces of software which we then use to deploy the next pieces of the system. Each component is restored using our version-controlled configuration files, instead of being manually configured.

We first needed to re-deploy our core continuous integration and orchestration tools. Once these tools were in place, we used them to recreate all virtual machines required for the platform and linked them with our existing storage volumes and databases. Next, we verified that customer applications and services were available.

Our documented re-deployment process was successful, and we maintained our documented security controls throughout the recovery process.

What we’re doing

We conducted a full analysis of the reasons for this problem, focused on the structural and procedural conditions that enabled this outage to occur. We have planned a set of mitigations to address the root issues.

Coordinate with upstream to reduce technical risk

We will work with the vendor to improve the problematic task in this software package. We will also discuss with them how to improve their communication of the maturity and risk level of the packages available for teams to use. This will help our team and other teams better estimate what depth of evaluation we need to do before testing their code.

Reduce platform re-deployment time

To recover from the deletion of all virtual machines, we had to re-deploy the platform from scratch, including the “bootstrap” steps that install and configure our entire set of deployment tools as well as deploying all of the platform components.

We ran into technical issues during the bootstrap steps that we had to troubleshoot and correct during the outage, such as references to older versions of tools. We don’t routinely run the entire bootstrap as part of normal operations, so we’re committing to a more frequent cadence for reviewing, updating, and testing the bootstrap process. This will reduce the potential impact of other unexpected serious technical problems in the future.

Isolate development environment

Separating our development environment from production using the Virtual Private Cloud feature was insufficient to prevent this problem. We will implement technical measures that will further isolate our development environment from our production environment. This would have prevented our core technical issue from affecting production. We will implement this by separating development into its own AWS account. This is the key mitigation because it is always possible to encounter new and unexpected kinds of technical errors, especially when working on improvements and new features. By completely isolating the development environment where we test new things, we will prevent many kinds of potential risks from affecting production.

Follow-ups

We will note our progress on these mitigations in our platform release notes, which we publish at https://cloud.gov/updates/ and by email to cloud.gov users.

Posted Jan 11, 2018 - 15:22 EST

Resolved

We've monitored the platform for any lingering problems from yesterday evening's outage, and the core platform has been stable. cloud.gov account login (for non-SSO user accounts) was unavailable until 09:04 EST today. We will update again with a postmortem analysis.

Currently some Redis service instances may be slow or unavailable, and we're working on resolving that.

Posted Jan 10, 2018 - 13:52 EST

Monitoring

Customer applications are available again, including Elasticsearch and Redis services.

We’ve verified that the main platform components are working as expected, including logging, the dashboard, and the CF CLI.

Currently, Elasticsearch and Redis may respond slowly, and cf push may return errors. This will be resolved soon, and we are monitoring the platform for any additional issues.

We plan to post again tomorrow with any updates. We will conduct a root cause analysis, and publish our analysis and how we will prevent this kind of problem in the future.

We apologize for this major outage. We understand and care that your applications deliver important services to the public and internal staff. This is the longest downtime we’ve had in the entire life of cloud.gov as a system, and we will learn from this event and improve our operations.

Posted Jan 09, 2018 - 23:42 EST

Update

Most customer applications are now returning 404 errors ("Requested route ('example.gov') does not exist.") instead of the previous 503 errors.

This is a sign of progress: platform components that route requested URLs to customer applications are working. The platform has not yet started running the customer applications, so the result of routing is a Not Found error for now.

Posted Jan 09, 2018 - 23:08 EST

Update

We’ve restored some services in staging, including logging, and we continue to restore the rest of the staging environment.

We are also in the process of restoring production services. We expect to start seeing login and CF CLI begin to work approximately in the next hour, and some customer applications will start coming back online as well.

We expect full restoration of the Redis and Elasticsearch services to take additional time, because they depend on restoring additional production components.

Posted Jan 09, 2018 - 22:32 EST

Update

We continue to restore the tools and components that build the system. This is a “bootstrapping” process, where we sequentially deploy pieces of software which we then use to deploy the next pieces of the system. Each component is restored using our version-controlled configuration files, instead of being manually configured.

During normal operations, this heavily-automated structure enables us to deploy frequent updates rapidly and reliably across the entire system. Because the original misconfigured script completely removed some of these components, we need to re-construct these components step by step. This takes time, but it ensures a reliable result.

Posted Jan 09, 2018 - 21:27 EST

Update

We’re still re-deploying platform components. We’ve restored a few pieces using our deployment tools, and we’re still re-building the rest of the components necessary to restore services.

Posted Jan 09, 2018 - 20:26 EST

Update

We’re continuing to reconstruct platform components. We’ve re-established the core of our continuous integration and continuous deployment tool (Concourse) as well as our main component orchestration tool (BOSH). These tools enable us to deploy the rest of the platform components, which we are working on now.

We will continue to update once an hour until this work is complete. After we analyze the root causes in detail, we will post our analysis and our planned actions for preventing this from happening again.

Posted Jan 09, 2018 - 19:31 EST

Update

We’ve determined that a script removed some production platform components due to misconfiguration by a cloud.gov team member. This means that we need to re-construct these platform components from their stored configurations.

No customer data is lost. The process of full reconstruction will take time, likely more than an hour but less than five hours. If you have any questions, email cloud-gov-support@gsa.gov.

Posted Jan 09, 2018 - 18:28 EST

Identified

Multiple customer applications and cloud.gov services are returning HTTP 503 errors as of 5:09 PM EST. We have identified the root cause of this issue and are working to restore service.

Posted Jan 09, 2018 - 17:29 EST