As part of our normal incident response process, we conducted a post-mortem analysis to determine why this incident occurred and how to improve our operations going forward.
Our main findings as to why this incident occurred were:
- Monitoring pending certificate expiration is currently a manual process
- The week of this incident in particular was very busy due to other incidents
- The user interface for monitoring expiring certificate shows some “false positives” which creates confusion
To address these findings and to prevent a recurrence of a similar incident in the future, we have planned the following work:
- Remove the “false positive” expired certificates in our certificate monitoring tool
- Add Slack alerts for expiring certificates to make the review process less manual and ensure that expiring certificates don’t get missed
- Schedule formal handoffs between engineers on maintenance rotations who are responsible for certificate renewal to ensure continuity of operations
As always, we appreciate your patience and thank you for being a cloud.gov customer. If you have any questions, don’t hesitate to contact us at support@cloud.gov.