Outage for Cloud.gov logging system

Incident Report for cloud.gov

Postmortem

The Cloud.gov team has conducted a post-mortem analysis of this incident. A timeline of the incident, our findings of what caused it, and our actions taken to prevent it from recurring are summarized below.

Timeline

  • On Monday, May 12, at 8:15 AM, a deployment of our production customer logging system was initiated to upgrade the system from OpenSearch version 2.19 to version 3.0.
  • At 9:12 AM, logs.fr.cloud.gov became inoperable due to a failure in the deployment of the nodes for OpenSearch Dashboards, which provide the visual user interface.
  • Once the outage was detected, the Cloud.gov team began investigating. We quickly discovered the issue and implemented a temporary fix.
  • At 9:38 AM, a new deployment was started in production to bring the OpenSearch Dashboards nodes back online.
  • By 10:10 AM, the new deployment had finished and the OpenSearch Dashboards were restored to fully health and running on version 3.0. At this time, the outage was resolved.

Findings

  • The deployment plan for the OpenSearch system was configured to update the nodes for different components (data nodes, Dashboards) in serial, one at a time.
  • In the deployment plan, the data nodes were upgraded before the Dashboards nodes.
  • In the initial deployment where OpenSearch Dashboard nodes failed to upgrade, the upgrade of the data nodes completed successfully without any issues.
  • When the Dashboards nodes attempted to upgrade individually, they recognized that the data nodes were on version 3.0, but that at least one of the Dashboards nodes was still on version 2.19, which prevented the deployment from succeeding. This error was observed in the logs for OpenSearch Dashboards:

    This version of OpenSearch Dashboards (v2.19.0) is incompatible with the following OpenSearch nodes in your cluster: v3.0.0  
    

  • To fix the deployment issues, the team updated the deployment plan to upgrade all OpenSearch Dashboards nodes at the same time rather than serially. As a result, all Dashboards nodes moved to version 3.0 at the same time, so no error because of incompatibility with the data nodes occurred.

Actions taken

To prevent this incident from recurring, we have taken the following actions:

As always, we appreciate your patience as a customer of Cloud.gov. If you have any questions about this incident, don’t hesitate to contact us at support@cloud.gov.

Posted May 14, 2025 - 17:19 EDT

Resolved

The Cloud.gov logging system is fully stable and healthy.

As with all incidents, we will be conducting a post-mortem of this outage in the coming days. Once our analysis is complete, we will share our findings and our plans to prevent a future recurrence of similar outages.

Thank you for your patience and for being a Cloud.gov customer.
Posted May 12, 2025 - 11:08 EDT

Monitoring

The logging system, https://logs.fr.cloud.gov/, is now up and appears to be responding normally to requests. We will continue to monitor the system closely for any issues.
Posted May 12, 2025 - 10:04 EDT

Identified

In the process of upgrading our Cloud.gov logging system from version to 2.19 to 3.0, the deployment experienced issues resulting in downtime for https://logs.fr.cloud.gov. We are working to fix the problem and will post an update as soon as possible.
Posted May 12, 2025 - 09:59 EDT
This incident affected: cloud.gov customer access (Logs front end).