Log outage for OpenSearch

Incident Report for cloud.gov

Resolved

Status:
• The OpenSearch cluster has processed the entire backlog and is now ingesting logs in real time without delay.
• All indices are writable and healthy, and write throughput remains stable.

Resolution Details:
• We increased disk capacity on the affected data nodes and rebalanced shard allocation to clear the high‐watermark condition.
• The cluster’s health is green and all new log events are successfully indexed.
• Live log streaming via cf logs APP_NAME continues to work as expected.

Next Steps:
• We will keep a heightened watch on disk usage and shard distribution over the next 24 hours to ensure sustained health.
• If you notice any further issues with log visibility or performance, please open a support ticket.

Thank you for your patience and apologies for any inconvenience.
Posted Apr 17, 2025 - 15:42 EDT

Monitoring

Update – 12:56 PM ET

Status:
- Logs are flowing into the OpenSearch cluster again, but indices are still catching up to real time.
- Full real-time ingestion is expected to resume within the next few hours.
- In the meantime, stream live application logs with:
cf logs APP_NAME

----

Technical Details

Durable storage & caching:
Application logs are first written to S3 for durability, then passed through a cache before landing in OpenSearch. This two‑step process ensures no data loss even if the cluster becomes temporarily unavailable.

Root cause:
Several OpenSearch data nodes exceeded their disk‑usage high watermark. When this threshold is crossed, OpenSearch marks the affected indices as read‑only and rejects new writes.

Mitigation:
We increased storage capacity on the affected nodes and rebalanced shard allocation across the cluster. The cluster is now healthy and processing the backlog of cached logs.

----

Next Update:
We will continue to monitor cluster health and ingestion rates. Our next status update will be posted by 3:30 PM ET, or sooner if anything changes.
Posted Apr 17, 2025 - 12:56 EDT

Investigating

We have noticed that no logs are appearing in OpenSearch for customer logs (https://logs.fr.cloud.gov) after approximately 10:11 AM ET. We are investigating and will provide an update as soon as we know more.
Posted Apr 17, 2025 - 11:46 EDT
This incident affected: cloud.gov customer access (Logs front end).