The main web dashboard for cloud.gov was unavailable to customers for part of the day on February 27. This happened because we accidentally removed the configuration for the dashboard in the process of updating it. When the configuration was restored, the dashboard became available again.
The dashboard runs as an application on cloud.gov, much like a customer application, so we want to explain this issue in depth to help customers avoid similar problems in developing and operating their own applications.
One of our operators was trying to unset an environment variable for the dashboard application, and these are the steps he took that resulted in the application being unavailable:
1) Our operator inspected the existing configuration using “cf env”, and found the environment variable he was looking for listed under a “User-Provided” heading in the output.
2) Using “cf update-user-provided-service”, the operator set the content of the bound user-provided service with the content listed under “User-Provided” in the “cf env” output, except for the variable that he wanted to remove.
3) This user-provided service instance was actually providing an unrelated set of configuration to the dashboard application (even though it had the same name), which meant that the operator overwrote that configuration. The overwritten configuration included the credentials needed by the dashboard application to operate properly, so the dashboard stopped working.
We noticed the dashboard was unavailable and restored the expected configuration, and it became available again.
To recap, the core problem was a confusing operator experience: the “User-Provided” heading in the “cf env” output was entirely unrelated to user-provided services. The operator should have used “cf unset-env” to unset just the intended environment variable, and left the user-provided service instance content intact.
We talked to the Cloud Foundry open source team that maintains this component, and we learned they are working on a less confusing interface for this situation. In the meantime, we can help prevent this kind of issue by documenting our mistake and encouraging customers to keep an eye out for it.
We are also moving soon to a new version of the dashboard (currently in beta) that has a different structure for managing configuration, which will reduce the chances of this problem happening again.
In addition, the best practice for modifying application configuration is to edit the application manifest file and deploy the application again using our CI/CD system. Instead, we were editing this application using CLI commands. The right configuration to update was obvious in the manifest. Based on this experience, we’re re-emphasizing in our team the importance of using this best practice for applications, and we encourage it for customers as well.