What Happened

Impact

Beginning August 6th, customer instances of Redis and Elasticsearch were subject to instability. This instability worsened August 12th and became a partial outage for approximately 6 hours on August 19th.

For most affected high-availability (HA) clusters, the periods of instability meant degraded performance; for affected non-HA instances and a small subset of HA instances, this meant periodic downtime ranging from minutes to hours.

Background

Our brokered Elasticsearch and Redis services are run on a cloud.gov-managed Kubernetes (k8s) cluster.

When we update the cluster, our automation tool (BOSH) updates one worker node at a time, with the following procedure:

Prevent new instances (“pods”) from being scheduled to the worker
Evict the existing pods from the worker
Wait for the evicted pods to leave the worker
Stop kubernetes and its dependencies on the worker
Upgrade code (scripts, binaries, etc) on the worker
Start kubernetes and its dependencies on worker
Allow new pods to be scheduled to the worker

What We’re Doing

Increasing and improving alerting

Our current alerts for k8s (and the components that depend on it) only alert us when there is a failure or when we are over capacity. We believe the system failed so severely because we did not start increasing capacity until we had already overloaded the cluster.

To prevent this from happening again, we’re working to improve alerts so we know when we’re nearing our anticipated capacity.

Improving communications

We under-communicated this incident, especially in its early stages. This was not due to a conscious decision to not communicate, but due to a lack of decision to declare a service degradation. We’re planning to:

Review our current procedures
Clarify thresholds to reduce guesswork about when to declare incidents
Add information to our alerting systems to suggest declaring incidents when appropriate

Stabilize Kubernetes

We are on a very old version of k8s and a very old version of docker. During this incident, we encountered several bugs and rough edges in both systems that are resolved in later versions. We’re planning to update our existing cluster or build out a new cluster.

Additionally, we had falsely believed that k8s proactively balances our cluster. We now know that is not true and that our usage patterns exacerbate this balancing issue, so we are planning to implement one of the cluster-balancing solutions available from the k8s community.

Timeline

2019-08-06 (approximately)
- We determined we did not have as much memory overhead on the kubernetes cluster as we needed and decided to solve this by using larger instance types (i.e. more memory and CPU)
- Beginning at this point, some customers saw degraded performance
2019-08-09
- We tested deploying the larger instance types in our staging cluster without incident
2019-08-12
- 11:09 AM PDT - We promoted the staging configuration into production
- 12:12 PM PDT - The first worker completed updating and reported it was successfully updated
- 12:12 PM PDT - The second worker began draining, causing pods to schedule on the newly-updated worker
- 12:12 PM PDT (approximately) Some pods scheduling to the updated node failed because their storage failed to attach. At this point, due to the reduced capacity in the cluster non-HA customers began seeing intermittent outages, and HA customers began seeing increasingly degraded performance.
- 12:24 PM PDT - The second worker being updated failed to drain, causing the configuration update deployment to fail
- 12:24 - 1:45 PM PDT - We manually corrected the failed pods by forcibly detaching their storage and deleting the pods, causing them to reschedule
- 1:48 PM PDT - We attempted the deployment again
- 2:04 PM PDT - The deployment failed again
- 2:19 PM PDT - We attempted the deployment again
- 2:31 PM PDT - The deployment failed again
- 3:00 PM PDT - We cordoned the new instance, causing it to be unavailable for new pods
2019-08-14
- 11:00 AM PDT - Pulled the entire engineering team into a meeting to discuss this issue. Previously, we had one engineer working this issue because the team had not fully realized the scope and impact of this issue.
- 11:00 AM PDT - We found that, with the new instance type we were trying to upgrade to, AWS limits the number of attached volumes, and that we were probably hitting that limit with our new instance. With the old instance type, this was not a hard limit - AWS has a recommended limit of 40 and we were attaching as many as 38. Because of this, we decided to use more of the existing instance type, rather than the same number of the larger instance type.
- 12:00 PM PDT - We made a plan:
  - Spend the rest of the day getting the load on the staging cluster more similar to the load in production
  - Spend Thursday, 8/15, testing the upgrade plan
  - Spend Friday, 8/16, releasing into production
- 12:25 PM PDT - We disabled non-HA Redis and Elasticsearch plans, preventing people from making new instances that were unlikely to function at all
2019-08-15
- We devoted the entire day to testing theories in our staging environment. Setup and testing were slower than anticipated, so we did not successfully release into staging by the end of the day
2019-08-16
- We continued testing in staging
- We thought the drain step may be optional - a shut down worker’s pods will get rescheduled either way and they are terminated in the same manner in either case. We tested the deployment without the drain step in place, and it failed due to timeouts. We misinterpreted the failures, and ended up troubleshooting a non-existent issue.
2019-08-19
- We continued testing deployment strategies in staging
- We discovered our previous misinterpretation and began troubleshooting the timeouts directly
- We decided to modify the deployment’s drain stage to a best-effort model so it would not fail if there are still pods on the node but instead to try to evacuate as many as it could. We ran into issues testing this, though. The drain script is updated by the same process that updates the rest of the node, so updating the drain scripts requires draining the node without the fix to prevent failures.
2019-08-20
- We manually updated the drain scripts on all of the staging nodes and ran our staging deployment, which completed and passed acceptance tests
- 12:31 PM PDT - We started the production deployment
- 12:31 PM PDT - We realised we had not merged the pull request to add more instances and stopped the deployment
- 12:32 PM PDT - We merged the pull request
- 12:47 PM PDT - We started the production deployment again
- 12:47 PM PDT - We realized that the configuration was pointed at an old branch that had actually scaled production down
- 12:48 PM PDT - We stopped the deployment in our CI server. This appeared successful, but we later found that it had not canceled the scale-down
- 12:50 PM PDT - We changed the configuration branch back to master, triggering staging acceptance tests to run again
- 1:11 PM PDT - The acceptance test suite failed due to a timeout on one of ten tests
- 1:34 PM PDT - The acceptance test suite failed again due to a timeout on a different test
- 2:12 PM PDT - The deployment we believed had been canceled scaled the production cluster down, reducing capacity by 12.5%. At this point, many non-HA instances became completely unavailable, many HA instances became very unstable, and some HA instances became completely unavailable.
- 2:29 PM PDT - The acceptance test suite failed yet again due to a timeout on another, different test
- 2:30 PM PDT - We verified that the failing test had actually completed its operations, just slightly slower than the test expected
- 2:34 PM PDT - We make the decision to bypass the acceptance tests
- 2:34 PM PDT - We start the production deployment again
- 2:35 PM PDT - The production deployment pauses because the deployment tool is unable to lock the resource it needs
- 2:40 PM PDT (approx) - We found that the deployment we canceled at 12:48 PM had continued to run, despite appearing canceled in our CI server
- 5:55 PM PDT - The reduced capacity in production began causing worse failures, so we updated the status from “Degraded Service” to “Partial Outage”
- 6:54 PM PDT - The new worker nodes came online, increasing our capacity to approximately 20% more than the original level prior to 8/11
- 7:16 PM PDT - We started manually forcing pods to schedule on the new nodes. At this point, affected customers began coming back online
- 10:58 PM PDT - We believed all but four pods had recovered, determined that all four unhealthy pods were part of HA deployments, and we had tried unsuccessfully to schedule those pods. We made the call that we had stabilized the cluster enough for the evening.
2019-08-20
- 08:55 AM PDT - We began working again on the last failing pods. We discovered at this point that the filter we’d been using to discover unhealthy pods was overly-strict, and there were actually ten unhealthy pods
- 10:15 AM PDT - All pods were healthy and we declared this incident resolved

Posted Aug 23, 2019 - 17:32 EDT

Resolved

As of approximately 10 AM PDT today, all customer Redis and Elasticsearch instances were fully operational, and no new failures have occurred since then.

Posted Aug 21, 2019 - 15:57 EDT

Update

The fix we applied seems to have worked, and the Redis and Elasticsearch services are now operational.
We are continuing to monitor to ensure there are no further problems.

Posted Aug 21, 2019 - 12:37 EDT

Monitoring

As of approximately 19:00 PDT, we increased the capacity of the backing kubernetes cluster to a level we believe will provide stability.
Since then, we have been restoring failing customer Redis and Elasticsearch instances. Most of these are now healthy, and we are working to fix the rest of them.

Posted Aug 21, 2019 - 00:23 EDT

Update

We've begun implementing a fix for this issue.
Due to an unexpected error, the process of deploying the fix will result in temporarily reduced capacity on the backing kubernetes clusters, resulting in increased instability for customer Elasticsearch and Redis instances.
We anticipate capacity to begin increasing by 8 PM Pacific time.

Posted Aug 20, 2019 - 20:55 EDT

Update

We have identified the technical issue causing these problems for Elasticsearch and Redis. We have implemented a solution in our staging environment and have verified that it resolves the problem. We’re working to make sure that our deployment process for this solution works well, and after we have finished that work, we expect to deploy this solution to production.

Posted Aug 20, 2019 - 17:14 EDT

Update

The Kubernetes cluster backing customer Redis and Elasticsearch instances is experiencing issues balancing load. This is causing intermittent failures for some customers. A bugfix was tested on Friday, August 16th; however, this did not perform in a satisfactory manner. We’ve increased capacity in production to help alleviate the issue. As of August 19, 2019 @ 1:45PM ET, cloud.gov staff continue to troubleshoot the root cause of the degraded operation and are working to restore normal operation of all services.

Posted Aug 19, 2019 - 17:04 EDT

Identified

We believe we've identified the problem with scaling out our service that supports these services. We are going to be updating the service in a couple of hours in order to restore the degraded service to normal operation. We anticipate this operation will take about 3 hours to update.

Posted Aug 16, 2019 - 16:10 EDT

Update

The Kubernetes cluster backing customer Redis and Elasticsearch instances is experiencing issues balancing load. This is causing intermittent failures for some customers. As of August 15, 2019 @ 2:30PM, cloud.gov staff have identified suspect components that appear to be causing these failures and are testing a bugfix to restore normal operation of all services.

Posted Aug 15, 2019 - 16:35 EDT

Investigating

The kubernetes cluster backing customer Redis and Elasticsearch instances is experiencing issues balancing load. This is causing intermittent failures for some customers.

Posted Aug 14, 2019 - 17:12 EDT

This incident affected: cloud.gov customer applications (Redis, Elasticsearch).