Application 502 errors
Incident Report for cloud.gov
Postmortem

As always, the cloud.gov team takes any incidents that disrupt customer traffic very seriously. While we apologize sincerely for the impact this incident had on your applications, it is critical that we as a team identify the causes of this incident and how to address them such that the incident does not recur. Accordingly, the cloud.gov team has conducted a post-mortem analysis of the incident and we are now sharing our findings with you and how we plan to address them going forward.

If you have any questions or concerns about this analysis, please do not hesitate to contact us at support@cloud.gov.

Thanks for being a cloud.gov customer!

Timeline

March 31, 2023 - pyOpenSSL makes breaking change: https://github.com/pyca/pyopenssl/pull/1208/files

May 30, 2023 - pyOpenSSL releases breaking change in 23.2.0: https://pypi.org/project/pyOpenSSL/

June 18, 2023 - Certificate renewal jobs download pyOpenSSL 23.2.0 and start failing due to the breaking changes

Tuesday, Jun 20, 2023

~8 AM ET - The cloud.gov team upgrades to python3.7-buster docker image to address failures in pyOpenSSL 23.2.0: https://github.com/cloud-gov/cg-provision/pull/1419/files

~9 AM ET - After the upgrade to python-3.7-buster image for certificate renewal jobs, certificates are renewed and uploaded to load balancers successfully, but these certificates use ECDSA-256 encryption and not RSA-2048 encryption

~1 PM ET - Production bootstrap job runs and applies new ECDSA-256 certificates to load balancers

1:31 PM ET - cloud.gov customers reach out about 502 errors on several of their domains served directly by our load balancers.

2:32 PM ET - The cloud.gov team makes a temporary fix of re-applying old certificates using RSA-2048 encryption that were removed from load balancer HTTPS listeners. After this change, customers report that 502 errors are no longer occurring.

Wednesday, Jun 21, 2023

9:30 AM ET - Certificate renewal jobs are re-run after being updated to specifically generate certificates using RSA-2048 encryption. The new certificates are rolled out to production.  Several of the customer endpoints that had failed the day before were retested via web browser successfully.

Analysis

The initial failures observed in the certificate renewal job were due to incompatibility between Python 3.6 and PyOpenSSL 23.2.0. While PyOpenSSL 23.2.0 was released on May 30 2023, the certificate renewal job only tries to renew certificates when they are less than 30 days from expiration, which began happening on Sunday, June 18, 2023.

As part of regular platform maintenance, the cloud.gov team identified the failing certificate renewal job and began working to address it on Tuesday, June 20, 2023. While investigating the failure, the team noticed that the job was running on a Python 3.6-based docker image (python:3.6-buster), which reached end-of-life support on December 31, 2021. So the cloud.gov team decided to upgrade to Python 3.7-based image (python:3.7-buster) to rule out Python version incompatibility issues.

Upgrading the job to use Python 3.7 did allow the certificate renewal job to renew the certificates and upload them successfully to be used by the platform load balancers which handle customer applications.

However, upgrading to Python 3.7 also caused the versions of installed packages to change. Notably, the installed version of the certbot library used to handle the certificate provisioning changed from 1.23.0 to 2.6.0, which is a major semantic version increase. One of the key breaking changes from certbot 1.x to 2.x is that certbot 1.x defaulted to provisioning RSA-2048 certificate private keys while certbot 2.x defaults to provisioning ECDSA secp256r1 (P-256) certificate private keys.

While ECDSA (or ECC) encryption is an emerging industry standard with several benefits over RSA, it is not supported on some older software and browsers, so some clients started to see errors once the platform load balancers were updated to use these certificates.

Ultimately, restoring the previous certificates that used RSA-2048 encryption to the platform load balancers resolved the 502 errors experienced by some customers.

From this analysis, we can identify a number of issues that contributed to the incident:

  • Running critical infrastructure jobs on Python 3.6, which was EOL on Dec 31, 2021: https://endoflife.date/Python
  • Insufficient testing of the provisioned certificates before attaching them to load balancers
  • Python dependencies for the certificate renewal job were not pinned, so the job was susceptible to breaking changes in upstream packages

Remediations

The cloud.gov team has already implemented several remediations to prevent a similar incident from occurring in the future:

With the above remediations in place, we successfully completed certificate renewals on the platform application load balancers on Wednesday June 21, 2023 without any adverse impact to customers.

We also plan to schedule the following work:

  • Check whether we’re using outdated versions of Python elsewhere anywhere in our automated jobs for cloud.gov and update as necessary
  • Move the certificate renewal job to the latest stable version of Python
Posted Jun 21, 2023 - 12:05 EDT

Resolved
At approximately 1:30 PM, customers began reporting 502 errors from their applications. Initial investigation revealed that application errors were limited to applications being served directly by the cloud.gov application load balancers not applications served by a CDN.

Subsequent investigation has revealed that the outage was caused by breaking changes in the packages used by our certificate renewal job which led to the provisioning of EDCSA-encrypted and not RSA-encrypted certificates on our load balancers. But as with all incidents, the cloud.gov will conduct an internal post-mortem of the incident and publish a detailed analysis of the causes in the coming days.
Posted Jun 20, 2023 - 13:30 EDT