Logs delayed on logs.fr.cloud.gov

Incident Report for cloud.gov

Postmortem

Summary

On August 14, 2025, we had intermittent issues with delayed log ingestion. During these periods of delayed log ingestion, customers may not have seen their logs from the past 1 to 2 hours on logs.fr.cloud.gov. We believe we have now identified and fixed the cause of delayed log ingestion, so we do not expect the problem to recur.

Timeline

  • August 14, 9:13 AM - A smoke test fails for the logs system, logs.fr.cloud.gov. A Cloud.gov engineer begins investigating the test failure and notices that log ingestion is delayed
  • 10:30 AM - An engineer scales up the log ingestion infrastructure
  • 1:52 PM - Log ingestion rates returned to normal and the system was ingesting near real-time logs again
  • 3:20 PM - A Cloud.gov engineer notices that log ingestion is delayed again
  • 4:12 PM - An engineer makes an OpenSearch configuration change to control how data ingestion is distributed across nodes
  • 5:47 PM - Log ingestion rates returned to normal and the system was ingesting near real-time logs again

Impact

Intermittently, customers may not have seen their logs appearing on logs.fr.cloud.gov in real time. Logs may have been delayed from appearing for 1 to 2 hours. While there were delays in logs appearing on logs.fr.cloud.gov, there was no loss of logs.

Root Cause

While investigating the second case of delayed log ingestion, we discovered that the log ingestion was being over-allocated to a specific data node, which overwhelmed that node’s CPU resources and caused the delays in log ingestion. To fix this issue, we updated a setting in OpenSearch that controls how many shards can be allocated to a single node for each index, which allowed data ingestion to be distributed more evenly across all of the data nodes.

Since making the change to shard allocation, log ingestion rates have remained stable and logs have been ingested in near real-time.

Next Steps

  • We are going to improve our internal documentation of how to diagnose and how to troubleshoot delayed log ingestion.
  • We are going to leave the log ingestion infrastructure at the scaled up levels.

Thank you for your patience. If you have any questions, please contact us at support@cloud.gov.

Posted Aug 15, 2025 - 11:38 EDT

Resolved

After adjusting some data node configuration, log ingestion has recovered and has stayed up to date.

As always, the Cloud.gov team takes these incidents very seriously. We will conduct a post-mortem analysis of this incident in the coming days and publish our findings.
Posted Aug 14, 2025 - 17:56 EDT

Identified

We are again having issues with our log ingestion rate for https://logs.fr.cloud.gov. Consequently, there may be a delay before customer application logs appear on the system.

We are actively investigating the cause of the slow log ingestion and working towards a solution.
Posted Aug 14, 2025 - 15:51 EDT
This incident affected: cloud.gov customer applications (Logs intake and storage).