Intermittent HTTP 5xx Errors

Incident Report for cloud.gov

Postmortem

Summary

The cloud.gov Ops team saw an increase in customer-reported HTTP 5xx errors in their applications. Upon further investigation, the cloud.gov Ops team determined the issues were likely caused by CPU exhaustion. CPU exhaustion, in the cloud, is a situation where an overloaded virtual machine cannot properly schedule workloads, generally causing I/O timeouts or connection failures. The cloud.gov Ops team scaled the platform’s compute cells to handle the increased load, asked customers to restage their applications to rebalance the compute pool, and the rate of HTTP 5xx errors decreased over time.

Long-Term Changes

As a way to prevent situations like this from happening again, here are some long-term investigations we have started:

Implement rate of change alarm for percent of 5xx errors
Implement a dropping rate of traffic alert
Better resource planning for Diego
Investigate how to understand what we think our scaling should be

Why did this occur?

The cloud.gov platform was originally designed using memory-optimized instances. The cloud.gov workload is primarily compute-intensive, as it is a container-based workload running primarily network-based operations, which is a CPU-intensive workload. The cloud.gov Ops team believes using memory-optimized instances for a compute-intensive workload was the largest contributing cause.

Posted Feb 20, 2020 - 14:18 EST

Resolved

This incident has been resolved.

Posted Feb 13, 2020 - 18:36 EST

Monitoring

A fix has been implemented and we are monitoring the results. If you are continuing to see HTTP 5xx errors in your applications, please run `cf restage ` on any affected applications, if possible.

Posted Feb 13, 2020 - 16:02 EST

Identified

We believe we have identified the issue and are in the process of deploying a resolution.

Posted Feb 13, 2020 - 14:30 EST

Investigating

Some customers are reporting HTTP 5xx errors. We are currently investigating the issue.

Posted Feb 13, 2020 - 13:44 EST

This incident affected: cloud.gov customer applications (Applications).