Degraded CF API Performance
Incident Report for cloud.gov
Postmortem

On February 15-16, 2019, our API was responding in an inconsistent manner, giving 503 server errors for some requests (503 typically means a service is unavailable). This means that some customers trying to do normal operations using the CF CLI or dashboard received errors. This was because of an internal technical operations problem that we recognized and resolved, which resolved the errors.

What happened

We run a set of automated tests as part of ordinary platform update operations. This automated system creates temporary orgs to run the tests. Normally this system automatically deletes the orgs after it completes the tests. A recent new version of the code had a problem that caused it to stop automatically deleting the orgs, so the number of orgs built up within cloud.gov. This large number of orgs (more than 7000) overwhelmed the API component in cloud.gov that manages information about orgs and other components, which caused it to return errors for some customer requests.

When we started seeing the errors, we researched the problem and noticed the large number of orgs. We immediately added more resources to the API component so that it could better handle the large amount of information, and then we deleted the unnecessary test orgs. This resolved the problem, and we also worked to resolve the code issue.

How will this change our behavior

It’s always possible for bugs to happen in code, so it’s important to our team to build a resilient structure that limits the impact of bugs and reduces the chance that they’ll cause user-facing service disruptions. We identified the following actions to improve our system and limit the potential for similar issues to cause problems.

* We changed the permissions for the test org creation process to constrain how many orgs it can create.

* We improved our alerting processes so that our team automatically gets notified of similar errors involving this particular component.

* We have dashboards which show key metrics for the various components that make up cloud.gov. We do not need to consult these dashboards often, so we had not noticed that they were no longer being populated with live metric data. This delayed our diagnosis and response. We will ensure that our metrics dashboards are properly populated, and that we’ll receive an alert if they’re not.

Posted 6 months ago. Apr 29, 2019 - 18:18 EDT

Resolved
As of 6:34pm ET Friday, the API began responding consistently again. We have identified the root cause and will be taking steps to prevent a recurrence of the same problem. We'll follow up with a postmortem report next week for those interested in the details. (Please accept our apologies for keeping this incident open 7 hours longer than necessary! The dates on this post have been edited to more accurately represent the window during which the problem existed.)
Posted 8 months ago. Feb 16, 2019 - 19:00 EST
Update
We are continuing to investigate this issue.
Posted 8 months ago. Feb 15, 2019 - 18:13 EST
Investigating
Despite earlier maintenance we've been unable to pinpoint the root cause of continued inconsistent responses in the Cloud Foundry API which is used by both the CLI and web dashboard to access to your orgs and spaces. You may experience this as commands failing with a 404 or 5xx error, then succeeding when you retry them. We are rolling out additional instances of the API service to spread out the impact while we continue to investigate.
Posted 8 months ago. Feb 15, 2019 - 18:11 EST
This incident affected: cloud.gov customer access (Dashboard, API).