Pushing application updates may fail
Incident Report for cloud.gov
Postmortem

What happened?

During routine platform updates that happen a few times a week, hosts for customer applications are rotated out of circulation. When this happens, customer applications are automatically relocated onto other available hosts. This time, our hosts were near their capacity based on the number of apps the platform was already hosting. Taking out a single host out of rotation was enough to put other hosts over their limit as apps were migrated. This prevented new apps from being scheduled. Users began seeing an “Insufficient Resources” error message when pushing applications, and our own tests started failing. Once we identified the problem, we increased the disk available to all hosts. After this change rolled out, the problem was resolved.

What we’re doing about it

  • We’re adding alerting so we’ll know when application hosts are nearing capacity for new applications.
  • We’re adding alerting for the specific “Insufficient Resources” situation.
  • We’re documenting how to increase capacity so anyone on the team can do it.
  • We’re going to start formally reviewing our capacity with each new customer added to cloud.gov.
  • We’re going to start formally reviewing our utilization on a regular basis so we can make better tuning decisions.
Posted May 11, 2017 - 15:24 EDT

Resolved
All application hosts now have sufficient capacity. You can push apps again without issue.
Posted May 11, 2017 - 15:23 EDT
Monitoring
We have identified that we are hitting reserved disk capacity limits for our application hosts. We have raised some of our hosts’ disk capacity and tested that application pushes are succeeding. We are monitoring the situation until the capacity on all hosts has been raised.
Posted May 11, 2017 - 14:06 EDT
Investigating
As of 12:16 PM ET, if you cf push an application, it is likely to fail with an “InsufficientResources” error, and you won’t be able to deploy it (so your application will be down). We recommend not pushing applications until we have a further update with resolution.

The cloud.gov team is currently upgrading underlying platform components, which seems to be causing this error; we will post an update when we have further information.
Posted May 11, 2017 - 12:42 EDT