cloud.gov logins using idp.fr.cloud.gov failing
Incident Report for cloud.gov
Postmortem

On July 18, 2023, the cloud.gov identity provider was unavailable for 1 hour, 6 minutes. Developers who use https://idp.fr.cloud.gov, instead of their agency's identity provider, got TLS certificate expiration errors and were unable to log in from 12:55 pm EDT until 2:01 pm. 

The impact was fairly modest: about 8-14 developers typically log in during that time period, but it was surely annoying to those dozen people who were unable to get work done, so we want to ensure outages of this type don't recur.

At cloud.gov, we thought we had solved TLS certificate expiration issues. All of our TLS endpoints have automatic processes to rotate certificates, we rotate them every 60 days (30 days before expiration) via timed pipeline processes, we use a certificate alert dashboard (doomsday) to notify us of expiring certs, and we monitor extensively. However:

  • The certificate rotation job on June 21 reported "Success" but did not, in fact, update the certificate on the impacted load balancer.
  • We had not backfilled https://idp.fr.cloud.gov into our Doomsday tracker, so we were unaware it was approaching expiration. Doomsday is self-discovering for our certificates in secure storage, but does not self-discover the load-balancer endpoints. In this case we had an updated certificate, but we were unaware the load-balancer had silently failed to receive the new certificate.
  • We didn't have a simple uptime/availability monitor on this key URL, so we didn't act until users notified us.

Our engineers have not yet determined why the rotation job failed, nor been able to recreate it, so we will continue to monitor our Terraform jobs for similar errors. 

On the monitoring side, we have backfilled Doomsday to include all the endpoints for active web applications. In the two weeks since this disruption, we have:

  • Added all missing URLs in cloud.gov to our uptime monitoring system 
  • Synchronized those same URLs to our Doomsday certificate expiration alert system
  • Updated our deployment documentation to improve monitoring coverage

We apologize for the inconvenience caused by this service disruption. As with all such incidents, we strive to learn from them and improve our processes and practices.

Posted Aug 03, 2023 - 11:16 EDT

Resolved
This incident was resolved by the time we posted this incident.

Logins were failing from 12:55 pm Eastern until 2:01 pm Eastern.

We are investigating why the certificate did not get auto-updated, nor why alerting did not catch this pre-expiration.

We will post a retrospective in the coming days.
Posted Jul 18, 2023 - 14:06 EDT
Identified
cloud.gov Pages and cloud.gov Platform developers are unable to log in using the cloud.gov Idp due to an expired TLS (SSL) certificate. We will be resolving shortly.
Posted Jul 18, 2023 - 14:03 EDT
This incident affected: cloud.gov Pages (Web Application) and cloud.gov customer access (Login).