Some customer instances of Redis and Elasticsearch were degraded or unavailable beginning 16:46 PDT September 3rd and ending 16:47 PDT September 4th.
Our brokered Elasticsearch and Redis services are run on a cloud.gov-managed Kubernetes (k8s) cluster.
High-availability (HA) instances of Redis have 3 masters, 3 sentinels, and 2 proxies per cluster. HA Elasticsearch clusters have 3 masters and 3 data nodes.
Thursday, August 29th:
- we updated the k8s cluster in production. The release ran without incident, but we discovered afterwards that it had reverted some necessary changes used by the cluster during deployments. We decided to re-release on Tuesday, so we could have operators on-hand to follow the update.
Tuesday, September 3rd
- 12:15 PDT - we began the k8s release to production
- 16:46 PDT - the first redis instance failed
- 17:00 PDT (approx) we began fixing individual failed instances. This continued until 23:30 PDT
- 19:38 PDT - the k8s deployment finished
- 23:30 PDT - We determined that all the instances that were not healthy fell into one failure category we’d seen previously that required restarting a kubernetes node to resolve, and this issue was spread across 3 kubernetes nodes. Due to the possibility of taking additional instances offline with a restart and due to the late hour, we made the decision to hold until morning
Wednesday, September 4th
- 06:37 PDT - we began troubleshooting again. Due to a bad handoff from the night before, the operators troubleshooting in the morning duplicated some of the troubleshooting steps from the night before
- 09:22 PDT - we restarted the first of three nodes that we believed needed restarting.
- 09:48 PDT - we restarted the second of three nodes we believed needed restarting.
- 10:06 PDT - we restarted the third of three nodes we believed needed restarting
- 10:06-16:47 PDT - we worked to troubleshoot individual containers, pods, and clusters. Notably, during this time we discovered three new failure modes that we had either not seen or not understood before, all involving problems with HA Redis and Elasticsearch clusters determining what node is master.
- 16:47 PDT - the last failing cluster was corrected.
What we’re doing
This incident had a lot of the same problems our most recent other incident had. We are still working through the action items from that incident. Many of the findings of our internal postmortem of this incident underscored the need to complete these.
Looking for alternate solutions
In addition to the prior tasks to improve the stability of this kubernetes cluster, we’re also investigating alternatives. Specifically, we hope to replace our home-grown solution with AWS ElastiCache (for Redis) and AWS Elasticsearch Service (for Elasticsearch). This has been ongoing but slow due to compliance requirements, but we’ve now requested a Risk-Based Decision from FedRAMP to try to get this approved faster so we can bring this stability sooner.
Note that this is not a done deal, and does not have an ETA - this is a path we’re looking into.
Increase individual cluster robustness
During this incident, we discovered application-level failures within customer Redis and Elasticsearch clusters. Specifically, multiple customer clusters became unavailable because the Redis and Elasticsearch clusters were unable to determine which node was master. We’re looking into how to make the clusters more robust.