Intermittent HTTP 5xx Errors
Incident Report for


The Ops team saw an increase in customer-reported HTTP 5xx errors in their applications. Upon further investigation, the Ops team determined the issues were likely caused by CPU exhaustion. CPU exhaustion, in the cloud, is a situation where an overloaded virtual machine cannot properly schedule workloads, generally causing I/O timeouts or connection failures. The Ops team scaled the platform’s compute cells to handle the increased load, asked customers to restage their applications to rebalance the compute pool, and the rate of HTTP 5xx errors decreased over time.

Long-Term Changes

As a way to prevent situations like this from happening again, here are some long-term investigations we have started:

  • Implement rate of change alarm for percent of 5xx errors
  • Implement a dropping rate of traffic alert
  • Better resource planning for Diego
  • Investigate how to understand what we think our scaling should be

Why did this occur?

The platform was originally designed using memory-optimized instances. The workload is primarily compute-intensive, as it is a container-based workload running primarily network-based operations, which is a CPU-intensive workload. The Ops team believes using memory-optimized instances for a compute-intensive workload was the largest contributing cause.

Posted Feb 20, 2020 - 14:18 EST

This incident has been resolved.
Posted Feb 13, 2020 - 18:36 EST
A fix has been implemented and we are monitoring the results. If you are continuing to see HTTP 5xx errors in your applications, please run `cf restage ` on any affected applications, if possible.
Posted Feb 13, 2020 - 16:02 EST
We believe we have identified the issue and are in the process of deploying a resolution.
Posted Feb 13, 2020 - 14:30 EST
Some customers are reporting HTTP 5xx errors. We are currently investigating the issue.
Posted Feb 13, 2020 - 13:44 EST
This incident affected: customer applications (Applications).