The cloud.gov Ops team saw an increase in customer-reported HTTP 5xx errors in their applications. Upon further investigation, the cloud.gov Ops team determined the issues were likely caused by CPU exhaustion. CPU exhaustion, in the cloud, is a situation where an overloaded virtual machine cannot properly schedule workloads, generally causing I/O timeouts or connection failures. The cloud.gov Ops team scaled the platform’s compute cells to handle the increased load, asked customers to restage their applications to rebalance the compute pool, and the rate of HTTP 5xx errors decreased over time.
As a way to prevent situations like this from happening again, here are some long-term investigations we have started:
The cloud.gov platform was originally designed using memory-optimized instances. The cloud.gov workload is primarily compute-intensive, as it is a container-based workload running primarily network-based operations, which is a CPU-intensive workload. The cloud.gov Ops team believes using memory-optimized instances for a compute-intensive workload was the largest contributing cause.