At 3:13 PM EST May 8, the cloud.gov team started to receive reports that the dashboard was not functioning correctly and users reported the following message:
An error occurred while trying to check your authorization. You may need to login again. Error: Request failed with status code 503.
Please check cloud.gov's status or try again.
Service was partially restored at 4:23 PM EST and was fully restored at 5:48 PM EST. We apologize for any inconvenience this may have caused.
For context, the dashboard (dashboard.fr.cloud.gov) is a specialized application running on the cloud.gov Platform as a Service, similar to tenant applications.
We triggered a new deployment to refresh the application. The deployment failed when connecting to the Redis service, indicating a deeper issue with the cloud.gov environment.
As an application with multiple instances, the dashboard uses the Redis service for session storage. Without Redis, the dashboard cannot remember which users are authenticated between requests and therefore cannot perform any actions on behalf of the user.
We created a new Redis instance and then deployed the dashboard again successfully.
We could have identified and resolved this issue faster though. The cloud.gov team was not automatically alerted to the outage due to a misconfigured alert. We fixed that misconfiguration. Also, we have included a health check for Redis along with the overall dashboard health check in order to have more visibility of the health of each component of the dashboard.
The Redis instance used by the dashboard is provisioned via our Kubernetes broker. We have enabled better logging which will make it easier to diagnose what went wrong with the Kubernetes Pod that the Redis instance is running on.