Around 10 pm PT on July 1, our logging system that imports customer application logs into http://logs.fr.cloud.gov/ got stuck in a way that its normal health-checking/restarting systems were unable to notice, and thus customer logs were not imported into logsearch until we were able to understand the problem and restart the stuck component at 1:32pm PT on July 2. This means that we lost approximately 15.5 hours of customer application logs.
We conducted an analysis of the reasons for this problem, focused on the structural and procedural conditions that enabled this outage to occur. We have planned a set of mitigations to address the root issues.
The logging system includes a code component that has had some upstream improvements that we had not yet merged into our system. Some of these improvements may prevent getting stuck.
We have incorporated the latest upstream changes into our logging system code.
We maintain an automated alerting system that notifies our team when components of the cloud.gov system stop working. This system correctly sent an alert when the problem started happening, but the alert language seemed to indicate that there was only a low-severity logging backup issue, rather than an important full outage of a logging component. This delayed our investigation. Once we investigated the alert and identified the severity of the issue, we resolved it within a few minutes.
We updated this alert text to be more clear and actionable, so that if a similar situation happens again, we will more quickly notice, investigate, and resolve it.