In the cloud.gov GovCloud environment, the system component that collects and stores customer application logs (Logsearch, which powers logs.fr.cloud.gov) unexpectedly stopped working from January 10 at 7:27 pm EST to January 11 at 3:04 pm EST. For approximately 12 hours during that span of time, the system did not store incoming customer application logs for applications in the GovCloud environment.
We apologize for this loss of log data; we deeply understand how important it is to have consistent and continuous logs for all applications and systems.
We’ve analyzed how this outage happened, and we’re working with the maintainers of the relevant open source project (Riemann) to fix a bug that contributed to the data loss, along with improving our alerting and storage systems to be more resilient when faced with unexpected bugs. We’re sharing our analysis below to explain how we’re making sure it doesn’t happen again, and to help maintainers of other systems that use Riemann prevent this issue.
Several days ago, we changed the configuration of the log system in the cloud.gov GovCloud environment (logs.fr.cloud.gov) to increase the number of days of customer application logs that it stores for online customer access. We didn’t realize we needed to also increase the configured storage capacity for log data. A few days later, that log storage system filled up with logs, so it couldn’t accept new logs.
cloud.gov has automatic tests that check whether the system is storing logs, and those tests correctly failed due to the storage system being full. When tests fail, they’re supposed to trigger the monitoring system to send alerts to our team. The monitoring component that should have alerted us had silently crashed because it had an unexpected bug, so it didn’t alert us to this situation.
Ordinarily, short outages in the log system do not result in data loss, because there is a buffer (a Redis queue) that holds logs until they can be stored properly, giving us time to fix problems when we see alerts. However, the buffer was not large enough to hold logs for the hours between the failed automatic test and our discovery of the main storage problem. Once this buffer was full, additional logs began to be lost as they were generated.
We noticed the storage problem the next morning, when we saw a failed deployment (due to the failed automatic tests) in our continuous deployment system. We opened a StatusPage notification and began resolving the issue. We fixed the main storage problem, which restored the availability of the log system. However, we saw that no new logs were being recorded in the system. We restarted the components that move logs from the buffer to the log storage system, which resolved the issue.
The queued logs surged in, then returned to normal weekday levels, as shown in this chart from logs.fr.cloud.gov (timestamps are in PST):
In analyzing this outage, we have identified multiple steps to prevent this and similar problems from occurring in the future.
In addition, this quarter we’re focusing on improving the resilience of existing cloud.gov subsystems, including logging, and we’ve delayed adding significant new platform features until we’ve completed this work.