GovCloud dashboard was unavailable
Incident Report for cloud.gov
Postmortem

At 3:13 PM EST May 8, the cloud.gov team started to receive reports that the dashboard was not functioning correctly and users reported the following message:

An error occurred while trying to check your authorization. You may need to login again. Error: Request failed with status code 503. 
Please check cloud.gov's status or try again.

Service was partially restored at 4:23 PM EST and was fully restored at 5:48 PM EST. We apologize for any inconvenience this may have caused.

What happened

For context, the dashboard (dashboard.fr.cloud.gov) is a specialized application running on the cloud.gov Platform as a Service, similar to tenant applications.

We triggered a new deployment to refresh the application. The deployment failed when connecting to the Redis service, indicating a deeper issue with the cloud.gov environment.

As an application with multiple instances, the dashboard uses the Redis service for session storage. Without Redis, the dashboard cannot remember which users are authenticated between requests and therefore cannot perform any actions on behalf of the user.

We created a new Redis instance and then deployed the dashboard again successfully.

What we're doing

We could have identified and resolved this issue faster though. The cloud.gov team was not automatically alerted to the outage due to a misconfigured alert. We fixed that misconfiguration. Also, we have included a health check for Redis along with the overall dashboard health check in order to have more visibility of the health of each component of the dashboard.

The Redis instance used by the dashboard is provisioned via our Kubernetes broker. We have enabled better logging which will make it easier to diagnose what went wrong with the Kubernetes Pod that the Redis instance is running on.

Posted May 17, 2017 - 16:33 EDT

Resolved
We have improved our monitoring of the cloud.gov dashboard. This improvement will gives us more insight into the dashboard so that we may keep the dashboard operational.
Posted May 10, 2017 - 10:13 EDT
Monitoring
Dashboard is functioning normally again, but we continue to work on analyzing the root cause so that we are more confident we can prevent this from happening again without notice.
Posted May 08, 2017 - 18:00 EDT
Identified
The cloud.gov dashboard was unavailable for a short time this afternoon, beginning around 3:13 pm ET and ending around 4:23 pm ET. This issue did not affect customer applications.

We have identified causes and put in place a workaround so that the dashboard is available again. We continue to resolve the underlying problem and will post an update when the dashboard is completely back to normal.
Posted May 08, 2017 - 16:50 EDT