Unable to provision new domains with cdn-route
Incident Report for cloud.gov
Postmortem

What happened

Impact

All cloud.gov users were unable to create new instances of the CDN service from approximately April 8th until April 17th. Existing instances of the service were unaffected.

Background

The CDN service creates AWS CloudFront distributions and provisions certificates from Let’s Encrypt.

We use two methods to get certificates from Let’s Encrypt:

1. DNS-01: We request a certificate for example.gov from Let’s Encrypt. They respond with a DNS challenge to add a TXT record for _acme-challenge.example.gov with a random string they generate.
2. HTTP-01: We request a certificate for example.gov from Let’s Encrypt. They respond with an HTTP challenge to add a document at http://example.gov/.acme-challenge with a random string they generate.

CloudFront distributions have default domain names in the form of .cloudfront.net, and they allow you to create CNAMEs on the distribution so it will serve traffic for the name you want, such as my-site.example.gov. (Note that CNAME here is an overloaded word: a CloudFront CNAME is not the same thing as a CNAME DNS record, and this document deals with both.)

Prior to this incident, the workflow was:

1. User creates a domain in Cloud Foundry:
$ cf create-domain my-org my-site.example.gov
2. User creates a CDN service instance in Cloud Foundry:
$ cf create-service cdn-route cdn-route my-cdn-route -c '{"domain": "my-site.example.gov"}'
3. The CDN service requests a certificate challenge from Let’s Encrypt via both DNS-01 and HTTP-01.

4. The CDN service creates the distribution in CloudFront.

5. The CDN service sets the service status on my-cdn-route to provisioning

6. The CDN service creates an origin in CloudFront that points to an S3 bucket with a file set to answer the Let’s Encrypt HTTP-01 challenge.

7. The user gets the service information from Cloud Foundry, which includes instructions on both the HTTP-01 and DNS-01 challenges:

$ cf service my-cdn-route
Last Operation
Status: create in progress
Message: Provisioning in progress [my-site.example.gov => cdn-broker-origin.fr.cloud.gov]; CNAME or ALIAS domain my-site.example.gov to .cloudfront.net or create TXT record(s):
name: _acme-challenge.my-site.example.gov., value: , ttl: 120

8. The user does one or other of these challenges.

9. Meanwhile, the CDN service continually asks Let’s Encrypt to check on the challenges. Whenever one or other of them is completed, Let’s Encrypt gives us a certificate.

10. The CDN service stores the certificate and key and updates CloudFront with them.

11. The CDN service sets the status of the service to provisioned

On April 10th, a user reported they were unable to use the CDN service to create a CDN service instance. We initially believed this was a one-off issue related to that user’s service instance.

We worked with that user on April 10th and 11th. On the 11th we determined the issue affected all users, and that no users were able to create new instances of the service. At that point we communicated to all users via cloudgov.statuspage.io that we were experiencing issues with provisioning.

Further research revealed that around April 8th, CloudFront changed their service to prevent domain fronting, which is used to circumvent content-filtering. They made it so that:
- Users must provide a valid certificate for any domain name added as a CNAME to a distribution.
- Any request sent to a distribution for a host that is not a CNAME on the distribution is refused with an error message. (According to CloudFront this will be a 421, but we’ve seen 403 errors as well.)

This puts us in a chicken-and-egg situation when trying to use HTTP-01 challenges: we can't add a domain name to a CloudFront distribution until we have a certificate, and Let’s Encrypt can't validate a domain on CloudFront until we add the domain to the distribution's list of CNAMEs.

When we realized the change AWS had made and how it was causing our service to fail, we began working on a code change to restore the service.

On the evening of April 15th we created a pull request to correct the issue. Overnight, the GOV.UK PaaS team overseas, which maintains a fork of our CDN service code (a copy of the code with their own improvements), saw the pull request, tested it, improved it, and submitted a pull request with their improvements. We merged this change on the 16th, then tested it in our staging environment. We then deployed the change on the 17th, and updated the status to “monitoring”. We reached out to users who had reported the issue to test it, and we confirmed the fix on the 18th, at which point we considered the issue resolved.

(Note: we haven't done anything to prevent HTTP-01 from working. If CloudFront were to undo their changes, we expect HTTP-01 would still work.)

Our new workflow, as of April 17th:

1. User creates a domain in Cloud Foundry:
$ cf create-domain my-org my-site.example.gov
2. User creates a CDN service instance in Cloud Foundry:
$ cf create-service cdn-route cdn-route my-cdn-route -c '{"domain": "my-site.example.gov"}'
3. The CDN service requests a certificate challenge from Let’s Encrypt via both DNS-01 and HTTP-01

4. The CDN service creates the distribution in CloudFront, but it omits the CNAMEs the user requested.

5. The CDN service sets the service status on my-cdn-route to provisioning

6. The CDN service creates an origin in CloudFront that points to an S3 bucket with a file set to answer the Let’s Encrypt HTTP-01 challenge.

7. The user gets the service information from Cloud Foundry, which includes instructions on both the HTTP-01 and DNS-01 challenges:

$ cf service my-cdn-route
Last Operation
Status: create in progress
Message: Provisioning in progress [my-site.example.gov => cdn-broker-origin.fr.cloud.gov]; CNAME or ALIAS domain my-site.example.gov to .cloudfront.net or create TXT record(s):
name: _acme-challenge.my-site.example.gov., value: , ttl: 120

8. The user does one or other of these challenges. We've updated the documentation to instruct users to use the DNS-01 challenge, but the broker's output is the same.
9. Meanwhile, the CDN service continually asks Let’s Encrypt to check on the challenges. Whenever one or other of them is completed, Let’s Encrypt gives us a certificate.
10. The CDN service stores the certificate and key and updates the CloudFront distribution to present them instead of the default CloudFront certificate. The CDN service simultaneously adds the new CNAME(s) to the CloudFront distribution.
11. The CDN service sets the status of the service to provisioned

What we’re doing

Fully restore functionality

We recognize that the HTTP-01 challenge type is preferable to many users, so we are working on creating a workflow that allows users to leverage that challenge type.

Scheduled testing

We found out about this issue from user reports, and it took us some time to determine that this was a global issue. We’re adding in scheduled, automatic testing of our interactions with Let’s Encrypt, CloudFront, and full end-to-end tests of the CDN service.

Decouple Let’s Encrypt and CloudFront code

This is simply a matter of simpler, more maintainable code, which should reduce time to resolution for any future issues with this broker.

Persist more information about service instances up front

We realized during this incident that we could not reliably recreate a service instance without the full details of the initial request.

Modernize tooling

The CDN service currently uses an older dependency management tool for Go, which makes it difficult to work with using current tools. Updating the dependency management tool should reduce time to resolution for future issues with this code.

Prevent duplicate certificates from being uploaded

We discovered that in certain error modes, the CDN service repeatedly uploads certificates to AWS IAM, which can quickly cause us to reach our account limits and can cause another provisioning outage. We’ve already eliminated one of those error modes, but we want to ensure that there’s only ever one copy of a certificate in AWS.

Posted 2 months ago. May 02, 2019 - 14:52 EDT

Resolved
This incident has been resolved. Note that the instructions for creating a cdn-route have changed slightly. The new changed should be reflected in the docs very soon. https://cloud.gov/docs/services/cdn-route/
Posted 3 months ago. Apr 18, 2019 - 18:26 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 3 months ago. Apr 17, 2019 - 14:13 EDT
Update
We are continuing to work on a fix for this issue.
Posted 3 months ago. Apr 17, 2019 - 12:32 EDT
Update
We are continuing to work on a fix for this issue.
Posted 3 months ago. Apr 15, 2019 - 11:37 EDT
Identified
We've identified this as a breaking change in an upstream API and are working on a fix.
Posted 3 months ago. Apr 12, 2019 - 13:35 EDT
Update
We are continuing to investigate this issue.
Posted 3 months ago. Apr 12, 2019 - 09:46 EDT
Update
We are continuing to investigate this issue.
Posted 3 months ago. Apr 12, 2019 - 07:57 EDT
Investigating
Some users are unable to provision new domains with cdn-route. This does not affect any existing domains.
Posted 3 months ago. Apr 11, 2019 - 20:18 EDT
This incident affected: cloud.gov customer applications (Service - CDN (cdn-route)).