Status: • The OpenSearch cluster has processed the entire backlog and is now ingesting logs in real time without delay. • All indices are writable and healthy, and write throughput remains stable.
Resolution Details: • We increased disk capacity on the affected data nodes and rebalanced shard allocation to clear the high‐watermark condition. • The cluster’s health is green and all new log events are successfully indexed. • Live log streaming via cf logs APP_NAME continues to work as expected.
Next Steps: • We will keep a heightened watch on disk usage and shard distribution over the next 24 hours to ensure sustained health. • If you notice any further issues with log visibility or performance, please open a support ticket.
Thank you for your patience and apologies for any inconvenience.
Posted Apr 17, 2025 - 15:42 EDT
Monitoring
Update – 12:56 PM ET
Status: - Logs are flowing into the OpenSearch cluster again, but indices are still catching up to real time. - Full real-time ingestion is expected to resume within the next few hours. - In the meantime, stream live application logs with: cf logs APP_NAME
----
Technical Details
Durable storage & caching: Application logs are first written to S3 for durability, then passed through a cache before landing in OpenSearch. This two‑step process ensures no data loss even if the cluster becomes temporarily unavailable.
Root cause: Several OpenSearch data nodes exceeded their disk‑usage high watermark. When this threshold is crossed, OpenSearch marks the affected indices as read‑only and rejects new writes.
Mitigation: We increased storage capacity on the affected nodes and rebalanced shard allocation across the cluster. The cluster is now healthy and processing the backlog of cached logs.
----
Next Update: We will continue to monitor cluster health and ingestion rates. Our next status update will be posted by 3:30 PM ET, or sooner if anything changes.
Posted Apr 17, 2025 - 12:56 EDT
Investigating
We have noticed that no logs are appearing in OpenSearch for customer logs (https://logs.fr.cloud.gov) after approximately 10:11 AM ET. We are investigating and will provide an update as soon as we know more.
Posted Apr 17, 2025 - 11:46 EDT
This incident affected: cloud.gov customer access (Logs front end).