Degraded CF API performance

Incident Report for cloud.gov

Postmortem

On February 19-20, we received reports from customers of intermittent API 503 errors, such as when making requests using the cloud.gov command-line interface and dashboard. We identified the problem and resolved it by replacing a virtual machine that was not behaving correctly.

What happened

This happened a few days after we resolved an issue where our automated tests created orgs but did not delete them as expected.

We investigated this new issue and recognized that the test process was still generating extra orgs. We identified one virtual machine that had not been updated with the fix we had deployed. This is a high-availability component running on several virtual machines, so we were able to delete that one problematic virtual machine and let the system automatically replace it with the updated version.

How will this change our behavior

This happened a few days after the original issue, so we had not yet implemented the mitigations we identified to reduce the chances of that issue recurring. Our mitigations for this follow-on issue is the same. Primarily, we needed to add an alert to automatically notify our operations team when there’s an unusual number of 503 errors on this component. Our dashboards would have also shown quickly that the 503 errors were happening only on one virtual machine, which would have reduced our diagnosis time and helped us resolve this more quickly. We have worked on both mitigations since then.

Posted Apr 29, 2019 - 18:23 EDT

Resolved

We discovered that a single VM among several identical ones was responsible for all the observed errors. Redeploying that VM has resolved the situation and APIs are now responding consistently.

Posted Feb 20, 2019 - 15:01 EST

Update

The cloud.gov engineering team are continuing to investigate this issue. We are seeing intermittent errors when pushing applications. Users may continue seeing the 503 error reporting that "Stats unavailable: Stats server temporarily unavailable." during a cf push, restart, or restage of their applications. We will continue to report the incident status as we gather more information.

At this time it's unadvisable to use the CF API without attempting retry operations in order to ensure the API call makes it through successfully.

Posted Feb 19, 2019 - 17:24 EST

Investigating

We are currently investigating an issue with our CF API returning 503 errors intermittently. Customers may encounter the error when using the CF CLI or the cloud.gov Dashboard stating the following message:
```
error: Server error, status code: 503, error code: 200002, message: Stats unavailable: Stats server temporarily unavailable.
error running command: exit status 1
```

Calls to the CF API may fail at this time.

Posted Feb 19, 2019 - 10:47 EST

This incident affected: cloud.gov customer access (Dashboard, API).