Customer application logs gap
Incident Report for

What happened

Around 10 pm PT on July 1, our logging system that imports customer application logs into got stuck in a way that its normal health-checking/restarting systems were unable to notice, and thus customer logs were not imported into logsearch until we were able to understand the problem and restart the stuck component at 1:32pm PT on July 2. This means that we lost approximately 15.5 hours of customer application logs.

What we're doing

We conducted an analysis of the reasons for this problem, focused on the structural and procedural conditions that enabled this outage to occur. We have planned a set of mitigations to address the root issues.

Update Components

The logging system includes a code component that has had some upstream improvements that we had not yet merged into our system. Some of these improvements may prevent getting stuck.

We have incorporated the latest upstream changes into our logging system code.

Improve Alerts

We maintain an automated alerting system that notifies our team when components of the system stop working. This system correctly sent an alert when the problem started happening, but the alert language seemed to indicate that there was only a low-severity logging backup issue, rather than an important full outage of a logging component. This delayed our investigation. Once we investigated the alert and identified the severity of the issue, we resolved it within a few minutes.

We updated this alert text to be more clear and actionable, so that if a similar situation happens again, we will more quickly notice, investigate, and resolve it.

Posted about 1 year ago. Jul 23, 2018 - 13:33 EDT

We’ve been monitoring the logging system and it continues to work normally as of 1:32 pm Pacific, so we’re closing this incident as resolved. We will do further analysis of the issue and provide a postmortem explanation.
Posted over 1 year ago. Jul 02, 2018 - 18:01 EDT
We lost logs for applications from around 10pm Pacific on July 1 until 1:32pm Pacific on July 2. We have cleared the problem, and logs are now flowing, but we are still investigating the root cause.
Posted over 1 year ago. Jul 02, 2018 - 16:51 EDT
This incident affected: customer applications (Logs intake and storage).