On February 15-16, 2019, our API was responding in an inconsistent manner, giving 503 server errors for some requests (503 typically means a service is unavailable). This means that some customers trying to do normal operations using the CF CLI or dashboard received errors. This was because of an internal technical operations problem that we recognized and resolved, which resolved the errors.
We run a set of automated tests as part of ordinary platform update operations. This automated system creates temporary orgs to run the tests. Normally this system automatically deletes the orgs after it completes the tests. A recent new version of the code had a problem that caused it to stop automatically deleting the orgs, so the number of orgs built up within cloud.gov. This large number of orgs (more than 7000) overwhelmed the API component in cloud.gov that manages information about orgs and other components, which caused it to return errors for some customer requests.
When we started seeing the errors, we researched the problem and noticed the large number of orgs. We immediately added more resources to the API component so that it could better handle the large amount of information, and then we deleted the unnecessary test orgs. This resolved the problem, and we also worked to resolve the code issue.
It’s always possible for bugs to happen in code, so it’s important to our team to build a resilient structure that limits the impact of bugs and reduces the chance that they’ll cause user-facing service disruptions. We identified the following actions to improve our system and limit the potential for similar issues to cause problems.
* We changed the permissions for the test org creation process to constrain how many orgs it can create.
* We improved our alerting processes so that our team automatically gets notified of similar errors involving this particular component.
* We have dashboards which show key metrics for the various components that make up cloud.gov. We do not need to consult these dashboards often, so we had not noticed that they were no longer being populated with live metric data. This delayed our diagnosis and response. We will ensure that our metrics dashboards are properly populated, and that we’ll receive an alert if they’re not.