Application and platform errors in GovCloud environment
Incident Report for cloud.gov
Postmortem

What Happened

We scheduled a maintenance window for updating the component we use for orchestrating and automating deployments of the rest of cloud.gov (BOSH). In addition to deploying other parts of the platform, this orchestration component also supplies domain name resolution (DNS) services so that the other components of the platform can contact each other without knowing their exact location (IP Address) within the larger system, including the internet. Ordinarily we do not expect that updating the orchestrator would cause problems in the availability of customer applications.

This time, the orchestrator was offline for a longer period than usual. Systems that depend on DNS have some caching mechanisms built in, but the orchestrator was unavailable for longer than the caching period. The system could not look up other names, including customer application names, until the orchestrator upgrade had completed. This caused customer applications to be unavailable during the upgrade.

What we're doing

We've verified that there is only a single component that is relying on the orchestrator for DNS. We will be phasing out the use of names for contacting this component and instead use its IP Address, so we can also phase out this dependence on our orchestrator for DNS. This will allow use to use a more highly-available DNS system for all components of the platform.

Posted Mar 14, 2017 - 15:07 EDT

Resolved
We've identified a previously undocumented dependency between components of cloud.gov which was causing unexpected outages when certain key dependencies were updated. We will be making both a short-term change to stop this from happening again, and a longer-term change which will remove the need for these components to talk to each other.
Posted Mar 14, 2017 - 00:29 EDT
Identified
As of 10:16 am ET, Applications that were previously down now seem to be available in a stable way (if you have a CDN, it may take a few minutes or refreshes to clear the cached error). We have identified this as likely a platform component deployment issue, and we’re continuing to monitor and analyze the platform until we’re confident that this will not re-occur.
Posted Feb 28, 2017 - 10:42 EST
Investigating
As of 10:02 am ET, some applications in the GovCloud environment are down and reporting errors (such as “Requested route does not exist”), and connecting to the platform is returning errors such as “Please ask your Cloud Foundry Operator to check the platform configuration” We are investigating this problem and we will provide updates as we identify it and resolve it.
Posted Feb 28, 2017 - 10:06 EST
Update
Maintenance has been completed. We are continuing to monitor the platform to ensure the incident has been fully resolved.
Posted Feb 25, 2017 - 04:48 EST
Update
Update: We’re currently doing maintenance in the GovCloud environment. During this maintenance period, you should not push updates to apps or restart them, because the deploy may return errors and fail, causing an outage for your application with no way to restart it until the maintenance is complete. This work should be complete within a few hours, and we’ll post an update when finished. Hopefully this is outside of work hours for you, but we do apologize for the unscheduled maintenance.
Posted Feb 24, 2017 - 21:58 EST
Monitoring
Apps continue to be available consistently. We are reproducing the problem in another environment to confirm our hypothesis of the root cause.
Posted Feb 24, 2017 - 13:51 EST
Identified
Applications that were previously down have been intermittently available for about 20 minutes and now seem to be available in a stable way (if you have a CDN, it may take a few minutes or refreshes to clear the cached error). We have identified this as likely a platform component deployment issue, and we’re continuing to monitor and analyze the platform until we’re confident that this will not re-occur.
Posted Feb 24, 2017 - 13:42 EST
Investigating
As of 12:40 pm ET, some applications in the GovCloud environment are down and reporting errors (such as “Requested route does not exist”), and connecting to the platform is returning errors such as “Please ask your Cloud Foundry Operator to check the platform configuration” We are investigating this problem with all hands on deck, and we will provide updates as we identify it and resolve it.
Posted Feb 24, 2017 - 13:15 EST