Beginning August 6th, customer instances of Redis and Elasticsearch were subject to instability. This instability worsened August 12th and became a partial outage for approximately 6 hours on August 19th.
For most affected high-availability (HA) clusters, the periods of instability meant degraded performance; for affected non-HA instances and a small subset of HA instances, this meant periodic downtime ranging from minutes to hours.
Our brokered Elasticsearch and Redis services are run on a cloud.gov-managed Kubernetes (k8s) cluster.
When we update the cluster, our automation tool (BOSH) updates one worker node at a time, with the following procedure:
Our current alerts for k8s (and the components that depend on it) only alert us when there is a failure or when we are over capacity. We believe the system failed so severely because we did not start increasing capacity until we had already overloaded the cluster.
To prevent this from happening again, we’re working to improve alerts so we know when we’re nearing our anticipated capacity.
We under-communicated this incident, especially in its early stages. This was not due to a conscious decision to not communicate, but due to a lack of decision to declare a service degradation. We’re planning to:
We are on a very old version of k8s and a very old version of docker. During this incident, we encountered several bugs and rough edges in both systems that are resolved in later versions. We’re planning to update our existing cluster or build out a new cluster.
Additionally, we had falsely believed that k8s proactively balances our cluster. We now know that is not true and that our usage patterns exacerbate this balancing issue, so we are planning to implement one of the cluster-balancing solutions available from the k8s community.
2019-08-06 (approximately)
2019-08-09
2019-08-12
2019-08-14
12:00 PM PDT - We made a plan:
12:25 PM PDT - We disabled non-HA Redis and Elasticsearch plans, preventing people from making new instances that were unlikely to function at all
2019-08-15
2019-08-16
2019-08-19
2019-08-20
2019-08-20