As always, the cloud.gov team takes any incidents that disrupt customer traffic very seriously. While we apologize sincerely for the impact this incident had on your applications, it is critical that we as a team identify the causes of this incident and how to address them such that the incident does not recur. Accordingly, the cloud.gov team has conducted a post-mortem analysis of the incident and we are now sharing our findings with you and how we plan to address them going forward.
If you have any questions or concerns about this analysis, please do not hesitate to contact us at email@example.com.
Thanks for being a cloud.gov customer!
March 31, 2023 - pyOpenSSL makes breaking change: https://github.com/pyca/pyopenssl/pull/1208/files
May 30, 2023 - pyOpenSSL releases breaking change in 23.2.0: https://pypi.org/project/pyOpenSSL/
June 18, 2023 - Certificate renewal jobs download pyOpenSSL 23.2.0 and start failing due to the breaking changes
Tuesday, Jun 20, 2023
~8 AM ET - The cloud.gov team upgrades to python3.7-buster docker image to address failures in pyOpenSSL 23.2.0: https://github.com/cloud-gov/cg-provision/pull/1419/files
~9 AM ET - After the upgrade to python-3.7-buster image for certificate renewal jobs, certificates are renewed and uploaded to load balancers successfully, but these certificates use ECDSA-256 encryption and not RSA-2048 encryption
~1 PM ET - Production bootstrap job runs and applies new ECDSA-256 certificates to load balancers
1:31 PM ET - cloud.gov customers reach out about 502 errors on several of their domains served directly by our load balancers.
2:32 PM ET - The cloud.gov team makes a temporary fix of re-applying old certificates using RSA-2048 encryption that were removed from load balancer HTTPS listeners. After this change, customers report that 502 errors are no longer occurring.
Wednesday, Jun 21, 2023
9:30 AM ET - Certificate renewal jobs are re-run after being updated to specifically generate certificates using RSA-2048 encryption. The new certificates are rolled out to production. Several of the customer endpoints that had failed the day before were retested via web browser successfully.
The initial failures observed in the certificate renewal job were due to incompatibility between Python 3.6 and PyOpenSSL 23.2.0. While PyOpenSSL 23.2.0 was released on May 30 2023, the certificate renewal job only tries to renew certificates when they are less than 30 days from expiration, which began happening on Sunday, June 18, 2023.
As part of regular platform maintenance, the cloud.gov team identified the failing certificate renewal job and began working to address it on Tuesday, June 20, 2023. While investigating the failure, the team noticed that the job was running on a Python 3.6-based docker image (python:3.6-buster), which reached end-of-life support on December 31, 2021. So the cloud.gov team decided to upgrade to Python 3.7-based image (python:3.7-buster) to rule out Python version incompatibility issues.
Upgrading the job to use Python 3.7 did allow the certificate renewal job to renew the certificates and upload them successfully to be used by the platform load balancers which handle customer applications.
However, upgrading to Python 3.7 also caused the versions of installed packages to change. Notably, the installed version of the certbot library used to handle the certificate provisioning changed from 1.23.0 to 2.6.0, which is a major semantic version increase. One of the key breaking changes from certbot 1.x to 2.x is that certbot 1.x defaulted to provisioning RSA-2048 certificate private keys while certbot 2.x defaults to provisioning ECDSA secp256r1 (P-256) certificate private keys.
While ECDSA (or ECC) encryption is an emerging industry standard with several benefits over RSA, it is not supported on some older software and browsers, so some clients started to see errors once the platform load balancers were updated to use these certificates.
Ultimately, restoring the previous certificates that used RSA-2048 encryption to the platform load balancers resolved the 502 errors experienced by some customers.
From this analysis, we can identify a number of issues that contributed to the incident:
The cloud.gov team has already implemented several remediations to prevent a similar incident from occurring in the future:
Locked dependencies on certbot used by our certificate renewal job
Updated job to enforce encryption type and size for certbot, rather than letting it default to ECDSA
Added an extra check using openSSL to check that the certificate is RSA-2048 before applying to load balancer
With the above remediations in place, we successfully completed certificate renewals on the platform application load balancers on Wednesday June 21, 2023 without any adverse impact to customers.
We also plan to schedule the following work: