On February 19-20, we received reports from customers of intermittent API 503 errors, such as when making requests using the cloud.gov command-line interface and dashboard. We identified the problem and resolved it by replacing a virtual machine that was not behaving correctly.
This happened a few days after we resolved an issue where our automated tests created orgs but did not delete them as expected.
We investigated this new issue and recognized that the test process was still generating extra orgs. We identified one virtual machine that had not been updated with the fix we had deployed. This is a high-availability component running on several virtual machines, so we were able to delete that one problematic virtual machine and let the system automatically replace it with the updated version.
This happened a few days after the original issue, so we had not yet implemented the mitigations we identified to reduce the chances of that issue recurring. Our mitigations for this follow-on issue is the same. Primarily, we needed to add an alert to automatically notify our operations team when there’s an unusual number of 503 errors on this component. Our dashboards would have also shown quickly that the 503 errors were happening only on one virtual machine, which would have reduced our diagnosis time and helped us resolve this more quickly. We have worked on both mitigations since then.