Logsearch is unavailable in GovCloud

Incident Report for cloud.gov

Postmortem

What happened

In the cloud.gov GovCloud environment, the system component that collects and stores customer application logs (Logsearch, which powers logs.fr.cloud.gov) unexpectedly stopped working from January 10 at 7:27 pm EST to January 11 at 3:04 pm EST. For approximately 12 hours during that span of time, the system did not store incoming customer application logs for applications in the GovCloud environment.

We apologize for this loss of log data; we deeply understand how important it is to have consistent and continuous logs for all applications and systems.

We’ve analyzed how this outage happened, and we’re working with the maintainers of the relevant open source project (Riemann) to fix a bug that contributed to the data loss, along with improving our alerting and storage systems to be more resilient when faced with unexpected bugs. We’re sharing our analysis below to explain how we’re making sure it doesn’t happen again, and to help maintainers of other systems that use Riemann prevent this issue.

What went wrong

Several days ago, we changed the configuration of the log system in the cloud.gov GovCloud environment (logs.fr.cloud.gov) to increase the number of days of customer application logs that it stores for online customer access. We didn’t realize we needed to also increase the configured storage capacity for log data. A few days later, that log storage system filled up with logs, so it couldn’t accept new logs.

cloud.gov has automatic tests that check whether the system is storing logs, and those tests correctly failed due to the storage system being full. When tests fail, they’re supposed to trigger the monitoring system to send alerts to our team. The monitoring component that should have alerted us had silently crashed because it had an unexpected bug, so it didn’t alert us to this situation.

Ordinarily, short outages in the log system do not result in data loss, because there is a buffer (a Redis queue) that holds logs until they can be stored properly, giving us time to fix problems when we see alerts. However, the buffer was not large enough to hold logs for the hours between the failed automatic test and our discovery of the main storage problem. Once this buffer was full, additional logs began to be lost as they were generated.

We noticed the storage problem the next morning, when we saw a failed deployment (due to the failed automatic tests) in our continuous deployment system. We opened a StatusPage notification and began resolving the issue. We fixed the main storage problem, which restored the availability of the log system. However, we saw that no new logs were being recorded in the system. We restarted the components that move logs from the buffer to the log storage system, which resolved the issue.

The queued logs surged in, then returned to normal weekday levels, as shown in this chart from logs.fr.cloud.gov (timestamps are in PST):

graph showing a histogram of logs arriving, with a gap in the middle

Steps we're taking as a result

In analyzing this outage, we have identified multiple steps to prevent this and similar problems from occurring in the future.

We filed a bug report with the maintainers of the alerting component to explain the problem we found and to help them fix it for all teams and companies that use this component.
We're adding an alert that triggers any time new logs aren't arriving. This is an “alert of last resort” to test the status from the user perspective, so that this will catch unforeseen logging problems in the future.
We’re substantially increasing the size of the log buffer to ensure that in case of a future problem with storing logs, cloud.gov can go much longer without losing new logs coming in.
We’re improving alerting for when the log buffer (Redis queue) is full, because that indicates logs are being lost.

In addition, this quarter we’re focusing on improving the resilience of existing cloud.gov subsystems, including logging, and we’ve delayed adding significant new platform features until we’ve completed this work.

Posted Jan 12, 2017 - 21:52 EST

Resolved

Logsearch has returned to a normal state.

Approximately 12 hours of log data was lost. We will follow up with additional details once we have completed our postmortem.

Posted Jan 11, 2017 - 16:45 EST

Identified

Due to increased utilization Logsearch in GovCloud is unavailable.

Additional resources are being provisioned to restore service.

Posted Jan 11, 2017 - 13:24 EST