Update - Since the changes that were deployed last week to increase the number of VMs available to host customer applications and to double the amount of memory available for staging operations, customers have reported a reduction in the frequency of "out of memory" issues, but are still experiencing them.

The cloud.gov team has continued to investigate the cause of these issues. After consulting with the CloudFoundry community, we believe that these issues may be caused by faulty memory allocation in the Linux kernel which is built-in to the stemcells for CloudFoundry VMs. This GitHub issue is being used to track investigation and resolution of the stemcell memory issues: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/issues/318

One of the recommendations from the community to resolve the "out of memory" issues was to roll back to stemcell version 1.340, however every stemcell release includes fixes for a number of CVEs (https://github.com/cloudfoundry/bosh-linux-stemcell-builder/releases), so rolling back would expose the platform and our customers to CVEs that are patched in the current stemcell version. At this time, cloud.gov does not plan to roll back our stemcell version given the potential security risk.

Another recommendation from the community was to increase the amount of memory available for staging, which we did last week when we increased the value from 1024 MB to 2048 MB, but customers continue to experience issues.

At this point, the plan for mitigation is to pursue ad hoc memory increases for applications that are still experiencing issues until a fix for the kernel/stemcell issue is released from upstream and can be deployed to our platform.

If your applications are still experiencing issues, please contact us at support@cloud.gov so we can work to resolve them for you.

Thank you for being a cloud.gov customer!

Feb 20, 2024 - 12:04 EST
Update - The cloud.gov team is continuing to investigate the causes of "out of memory" errors that are being seen for some customer applications.

In order to address these errors, at approximately 10:48 AM ET, the cloud.gov team deployed two changes to our production environment:

- Increased the number of VMs available to host customer applications
- Doubled the amount of memory available for staging applications from 1024 MB to 2048 MB

Customers experiencing "out of memory" errors for their applications should try restaging their applications via 'cf restage' or 'cf restage --strategy rolling' to see if the issue is resolved.

Please contact support@cloud.gov if you have further questions or concerns.

Feb 13, 2024 - 14:37 EST
Update - Due to an on-going security incident, we have temporarily paused internal cloud.gov platform deployments. This pause will continue to impact the time to resolution for the Out Of Memory (OOM) issue that we are still addressing. We are continuing to mitigate the OOM issue in the meantime.

Please reach out to support@cloud.gov if you are experiencing issues with your applications and we will assist you with mitigations while we work to resolve these incidents.

Feb 12, 2024 - 13:40 EST
Monitoring - The cloud.gov team has deployed a fix and is monitoring the result. Customers whose applications are failing with memory-related errors should restage their applications with `cf restage` or `cf restage --strategy rolling` and reach out to cloud.gov support via support@cloud.gov if they continue to experience errors.
Feb 08, 2024 - 18:26 EST
Update - We believe the OOM errors may be caused by a bug in the latest stemcell version pushed to production on January 30. We are deploying an updated version which contains a fix. Deployment is expected to complete after east coast close of business. We will monitor the rollout and post updates as we have them.
Feb 08, 2024 - 12:25 EST
Investigating - Some applications on cloud.gov have been experiencing intermittent out-of-memory errors while staging or running on the platform starting on January 30. The cloud.gov team is investigating the issue. For apps experiencing OOM errors, 'cf restage' or 'cf restage --strategy rolling' may temporarily resolve the issue.
Feb 08, 2024 - 10:54 EST

About This Site

Scheduled maintenance and outage information for cloud.gov customers

cloud.gov customer applications Operational
90 days ago
99.99 % uptime
Today
Applications ? Operational
90 days ago
99.95 % uptime
Today
Logs intake and storage ? Operational
90 days ago
100.0 % uptime
Today
Service - CDN (cdn-route) ? Operational
90 days ago
100.0 % uptime
Today
Service - Relational databases (RDS) ? Operational
90 days ago
100.0 % uptime
Today
Service - S3 ? Operational
90 days ago
100.0 % uptime
Today
Service - Service account ? Operational
90 days ago
100.0 % uptime
Today
Redis ? Operational
90 days ago
100.0 % uptime
Today
Elasticsearch ? Operational
90 days ago
100.0 % uptime
Today
Service - Custom Domain Service Operational
90 days ago
100.0 % uptime
Today
External domain service ? Operational
90 days ago
100.0 % uptime
Today
External domain service - CDN ? Operational
90 days ago
100.0 % uptime
Today
cloud.gov customer access Operational
90 days ago
99.98 % uptime
Today
Dashboard ? Operational
90 days ago
100.0 % uptime
Today
Logs front end ? Operational
90 days ago
99.92 % uptime
Today
Login ? Operational
90 days ago
100.0 % uptime
Today
API ? Operational
90 days ago
100.0 % uptime
Today
cloud.gov Pages Operational
90 days ago
99.22 % uptime
Today
Web Application ? Operational
90 days ago
100.0 % uptime
Today
Builds ? Operational
90 days ago
97.68 % uptime
Today
Hosted Sites ? Operational
90 days ago
100.0 % uptime
Today
Services cloud.gov depends on Operational
90 days ago
100.0 % uptime
Today
AWS CloudFront ? Operational
AWS elb-us-gov-west-1 ? Operational
AWS s3-us-gov-west-1 ? Operational
AWS rds-us-gov-west-1 ? Operational
AWS ec2-us-gov-west-1 ? Operational
AWS elasticsearch-us-gov-west-1 ? Operational
AWS elasticache-us-gov-west-1 ? Operational
GSA SecureAuth ? Operational
90 days ago
100.0 % uptime
Today
GSA Corporate Email Operational
90 days ago
100.0 % uptime
Today
cloud.gov website ? Operational
90 days ago
100.0 % uptime
Today
cloud.gov compliance notification ? Operational
Services cloud.gov Pages depends on Operational
90 days ago
99.62 % uptime
Today
GitHub Operational
90 days ago
100.0 % uptime
Today
GitHub Webhooks Operational
90 days ago
99.25 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Mar 18, 2024

No incidents reported today.

Mar 17, 2024

No incidents reported.

Mar 16, 2024

No incidents reported.

Mar 15, 2024

No incidents reported.

Mar 14, 2024

No incidents reported.

Mar 13, 2024

No incidents reported.

Mar 12, 2024

No incidents reported.

Mar 11, 2024

No incidents reported.

Mar 10, 2024

No incidents reported.

Mar 9, 2024

No incidents reported.

Mar 8, 2024
Postmortem - Read details
Mar 13, 10:44 EDT
Resolved - The log cache system has been updated with the renewed certificate. Our testing indicates that real-time logs can now be successfully retrieved using the "cf logs" CLI commands.

As with all incidents, the cloud.gov team will conduct a post-mortem analysis of this incident in the coming days and post our findings here as an update.

Thank you for being a cloud.gov customer!

Mar 8, 19:35 EST
Update - We have renewed the certificate for the log cache component and we have started a full redeployment of our production system to apply the renewed certificates to the log cache.

It may take several hours for the renewed certificate to roll out through the system, but we will post an update once we can confirm the updated certificate has been applied.

Mar 8, 17:23 EST
Identified - We have received reports from customers that using "cf logs" CLI command to retrieve logs from their applications is either not working or not showing recent logs.

Customers have confirmed that real-time logs are still being received in the customer logs Elasticsearch/Kibana instance at https://logs.fr.cloud.gov and are being sent correctly through log drains.

Our team has already identified the possible cause of this issue as an expired certificate for the Log Cache component, which is the component that the "cf logs" CLI command uses to retrieve logs. The certificate expired at approximately 1:18 PM ET. We are working to remediate the issue.

Mar 8, 17:01 EST
Mar 7, 2024

No incidents reported.

Mar 6, 2024

No incidents reported.

Mar 5, 2024

No incidents reported.

Mar 4, 2024

No incidents reported.