Out of memory issues in running and staging apps
Incident Report for cloud.gov
Resolved
Since rolling out stemcell version 1.404 to the platform last week, we have received no further reports of out of memory issues and our own internal metrics show a decline in these errors, so this incident is resolved.

If you are still experiencing issues with your applications, please contact support@cloud.gov.
Posted Mar 26, 2024 - 15:59 EDT
Update
After further testing and debugging, members of the CloudFoundry community were able to isolate the cause of the "out of memory" issues to an incompatibility between Linux cgroups v1 and version 6.5 of the Linux kernel, both of which were used by the latest stemcells. cgroups are a process isolation mechanism often used to manage container processes, including customer applications running on cloud.gov.

To fix the out of memory issues, the CloudFoundry community has released a new stemcell version, 1.404, which uses version 5.15 of the Ubuntu Jammy kernel. Version 5.15 is the long-term supported release of Ubuntu Jammy (https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle), so this release will continue to receive security patches and other fixes.

We are rolling out stemcell version 1.404 to the platform today and expect to see a reduction in memory use across our platform, including the possible resolution of all "out of memory" issues for customer applications.

Even though we only expect these changes to benefit our platform and our customers, we will still be closely monitoring our platform for stability as we roll out these changes. If you experience any issues, don't hesitate to contact us at support@cloud.gov.
Posted Mar 19, 2024 - 10:42 EDT
Update
Since the changes that were deployed last week to increase the number of VMs available to host customer applications and to double the amount of memory available for staging operations, customers have reported a reduction in the frequency of "out of memory" issues, but are still experiencing them.

The cloud.gov team has continued to investigate the cause of these issues. After consulting with the CloudFoundry community, we believe that these issues may be caused by faulty memory allocation in the Linux kernel which is built-in to the stemcells for CloudFoundry VMs. This GitHub issue is being used to track investigation and resolution of the stemcell memory issues: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/issues/318

One of the recommendations from the community to resolve the "out of memory" issues was to roll back to stemcell version 1.340, however every stemcell release includes fixes for a number of CVEs (https://github.com/cloudfoundry/bosh-linux-stemcell-builder/releases), so rolling back would expose the platform and our customers to CVEs that are patched in the current stemcell version. At this time, cloud.gov does not plan to roll back our stemcell version given the potential security risk.

Another recommendation from the community was to increase the amount of memory available for staging, which we did last week when we increased the value from 1024 MB to 2048 MB, but customers continue to experience issues.

At this point, the plan for mitigation is to pursue ad hoc memory increases for applications that are still experiencing issues until a fix for the kernel/stemcell issue is released from upstream and can be deployed to our platform.

If your applications are still experiencing issues, please contact us at support@cloud.gov so we can work to resolve them for you.

Thank you for being a cloud.gov customer!
Posted Feb 20, 2024 - 12:04 EST
Update
The cloud.gov team is continuing to investigate the causes of "out of memory" errors that are being seen for some customer applications.

In order to address these errors, at approximately 10:48 AM ET, the cloud.gov team deployed two changes to our production environment:

- Increased the number of VMs available to host customer applications
- Doubled the amount of memory available for staging applications from 1024 MB to 2048 MB

Customers experiencing "out of memory" errors for their applications should try restaging their applications via 'cf restage' or 'cf restage --strategy rolling' to see if the issue is resolved.

Please contact support@cloud.gov if you have further questions or concerns.
Posted Feb 13, 2024 - 14:37 EST
Update
Due to an on-going security incident, we have temporarily paused internal cloud.gov platform deployments. This pause will continue to impact the time to resolution for the Out Of Memory (OOM) issue that we are still addressing. We are continuing to mitigate the OOM issue in the meantime.

Please reach out to support@cloud.gov if you are experiencing issues with your applications and we will assist you with mitigations while we work to resolve these incidents.
Posted Feb 12, 2024 - 13:40 EST
Monitoring
The cloud.gov team has deployed a fix and is monitoring the result. Customers whose applications are failing with memory-related errors should restage their applications with `cf restage` or `cf restage --strategy rolling` and reach out to cloud.gov support via support@cloud.gov if they continue to experience errors.
Posted Feb 08, 2024 - 18:26 EST
Update
We believe the OOM errors may be caused by a bug in the latest stemcell version pushed to production on January 30. We are deploying an updated version which contains a fix. Deployment is expected to complete after east coast close of business. We will monitor the rollout and post updates as we have them.
Posted Feb 08, 2024 - 12:25 EST
Investigating
Some applications on cloud.gov have been experiencing intermittent out-of-memory errors while staging or running on the platform starting on January 30. The cloud.gov team is investigating the issue. For apps experiencing OOM errors, 'cf restage' or 'cf restage --strategy rolling' may temporarily resolve the issue.
Posted Feb 08, 2024 - 10:54 EST
This incident affected: cloud.gov customer applications (Applications).