On Tuesday January 9 from 17:09 EST to approximately 23:42 EST (6 hours and 33 minutes), the cloud.gov platform and customer applications were unavailable. This was the longest and most significant outage in the history of our platform. No data was lost.
The initial cause: Our team was developing a new feature in our development environment, which is separated from our staging and production environments using the Amazon Virtual Private Cloud feature. This work included running a vendor software package that created and deleted virtual machine instances to complete its tasks. Creating and deleting virtual machines is an operation that can’t be logically restricted by Virtual Private Clouds because of the structure of AWS services and permissions. At 17:09, the package’s cleanup task ran, which is intended to remove the virtual machine instances it created. This had an unexpected effect: it removed all of the virtual machine instances in the AWS account, including production instances. This happened because the task was not written to only delete the virtual machine instances that it had created. This did not affect databases or storage volumes.
We immediately received an alert and identified the cause of the deleted instances. We have a documented process for completely re-deploying the platform, and we began following that process. This is a “bootstrapping” process, where we sequentially deploy pieces of software which we then use to deploy the next pieces of the system. Each component is restored using our version-controlled configuration files, instead of being manually configured.
We first needed to re-deploy our core continuous integration and orchestration tools. Once these tools were in place, we used them to recreate all virtual machines required for the platform and linked them with our existing storage volumes and databases. Next, we verified that customer applications and services were available.
Our documented re-deployment process was successful, and we maintained our documented security controls throughout the recovery process.
We conducted a full analysis of the reasons for this problem, focused on the structural and procedural conditions that enabled this outage to occur. We have planned a set of mitigations to address the root issues.
Coordinate with upstream to reduce technical risk
We will work with the vendor to improve the problematic task in this software package. We will also discuss with them how to improve their communication of the maturity and risk level of the packages available for teams to use. This will help our team and other teams better estimate what depth of evaluation we need to do before testing their code.
Reduce platform re-deployment time
To recover from the deletion of all virtual machines, we had to re-deploy the platform from scratch, including the “bootstrap” steps that install and configure our entire set of deployment tools as well as deploying all of the platform components.
We ran into technical issues during the bootstrap steps that we had to troubleshoot and correct during the outage, such as references to older versions of tools. We don’t routinely run the entire bootstrap as part of normal operations, so we’re committing to a more frequent cadence for reviewing, updating, and testing the bootstrap process. This will reduce the potential impact of other unexpected serious technical problems in the future.
Isolate development environment
Separating our development environment from production using the Virtual Private Cloud feature was insufficient to prevent this problem. We will implement technical measures that will further isolate our development environment from our production environment. This would have prevented our core technical issue from affecting production. We will implement this by separating development into its own AWS account. This is the key mitigation because it is always possible to encounter new and unexpected kinds of technical errors, especially when working on improvements and new features. By completely isolating the development environment where we test new things, we will prevent many kinds of potential risks from affecting production.
Follow-ups
We will note our progress on these mitigations in our platform release notes, which we publish at https://cloud.gov/updates/ and by email to cloud.gov users.