On July 18, 2023, the cloud.gov identity provider was unavailable for 1 hour, 6 minutes. Developers who use https://idp.fr.cloud.gov, instead of their agency's identity provider, got TLS certificate expiration errors and were unable to log in from 12:55 pm EDT until 2:01 pm.
The impact was fairly modest: about 8-14 developers typically log in during that time period, but it was surely annoying to those dozen people who were unable to get work done, so we want to ensure outages of this type don't recur.
At cloud.gov, we thought we had solved TLS certificate expiration issues. All of our TLS endpoints have automatic processes to rotate certificates, we rotate them every 60 days (30 days before expiration) via timed pipeline processes, we use a certificate alert dashboard (doomsday) to notify us of expiring certs, and we monitor extensively. However:
Our engineers have not yet determined why the rotation job failed, nor been able to recreate it, so we will continue to monitor our Terraform jobs for similar errors.
On the monitoring side, we have backfilled Doomsday to include all the endpoints for active web applications. In the two weeks since this disruption, we have:
We apologize for the inconvenience caused by this service disruption. As with all such incidents, we strive to learn from them and improve our processes and practices.