Log cache component not returning logs
Incident Report for cloud.gov
Postmortem

As part of our normal incident response process, we conducted a post-mortem analysis to determine why this incident occurred and how to improve our operations going forward.

Our main findings as to why this incident occurred were:

  • Monitoring pending certificate expiration is currently a manual process
  • The week of this incident in particular was very busy due to other incidents
  • The user interface for monitoring expiring certificate shows some “false positives” which creates confusion

To address these findings and to prevent a recurrence of a similar incident in the future, we have planned the following work:

  • Remove the “false positive” expired certificates in our certificate monitoring tool
  • Add Slack alerts for expiring certificates to make the review process less manual and ensure that expiring certificates don’t get missed
  • Schedule formal handoffs between engineers on maintenance rotations who are responsible for certificate renewal to ensure continuity of operations

As always, we appreciate your patience and thank you for being a cloud.gov customer. If you have any questions, don’t hesitate to contact us at support@cloud.gov.

Posted Mar 13, 2024 - 10:44 EDT

Resolved
The log cache system has been updated with the renewed certificate. Our testing indicates that real-time logs can now be successfully retrieved using the "cf logs" CLI commands.

As with all incidents, the cloud.gov team will conduct a post-mortem analysis of this incident in the coming days and post our findings here as an update.

Thank you for being a cloud.gov customer!
Posted Mar 08, 2024 - 19:35 EST
Update
We have renewed the certificate for the log cache component and we have started a full redeployment of our production system to apply the renewed certificates to the log cache.

It may take several hours for the renewed certificate to roll out through the system, but we will post an update once we can confirm the updated certificate has been applied.
Posted Mar 08, 2024 - 17:23 EST
Identified
We have received reports from customers that using "cf logs" CLI command to retrieve logs from their applications is either not working or not showing recent logs.

Customers have confirmed that real-time logs are still being received in the customer logs Elasticsearch/Kibana instance at https://logs.fr.cloud.gov and are being sent correctly through log drains.

Our team has already identified the possible cause of this issue as an expired certificate for the Log Cache component, which is the component that the "cf logs" CLI command uses to retrieve logs. The certificate expired at approximately 1:18 PM ET. We are working to remediate the issue.
Posted Mar 08, 2024 - 17:01 EST
This incident affected: cloud.gov customer access (Logs front end).