cloud.gov UAA/login timeouts
Incident Report for cloud.gov
Postmortem

What happened

A core component of cloud.gov for application owners is the Cloud Foundry User Account and Authentication (UAA) Server. We use it for authentication when there’s no agency identity provider available for a given user, and for all user authorization (for example, what organizations a user has access to).

The Cloud Foundry UAA server release 4.6.0 of 2017-09-12 had a bug that caused it to consume all available memory over the course of days or weeks and stop working. We upgraded our production UAA to this buggy version on 2017-09-26 as part of a routine update.

On the morning of 2017-10-11, one of our two UAA servers stopped responding due to the memory leak issue. Traffic continue to be routed to the failed UAA server, causing most logins to fail, and the UAA process failed to restart.

The failures started at about 05:47 EDT, and continued until the first operators came on-duty and responded to the failed health checks at 09:07 EDT. Service was restored 09:55 after a new UAA server was brought online. Later that same day we upgraded UAA in production after first testing in our staging environment.

A patched UAA server, 4.6.1, had been released with CloudFoundry release 276 on 2017-10-03, one week earlier. Our team did not expedite deployment of this release since the release notes did not mention the underlying UAA memory leak issue.

What we’re doing

The cloud.gov operations team is taking the following steps:

  • Working with the Cloud Foundry release team to better surface critical fixes of components in release notes
  • Ensuring router health checks are up-to-date. We’ll also schedule migration to the upstream cf-deployment once there is a migration path from cf-release
  • Examining the UAA process supervisor to make sure it’s resilient in the event of failure.
Posted Oct 24, 2017 - 03:14 EDT

Resolved
We are increasing the available resources for UAA to resolve this problem.

UPDATE: We have been running UAA version which is now known to have a memory leak which resulted in this issue. We will be upgrading UAA in the next 24 hours.
Posted Oct 11, 2017 - 15:51 EDT
Monitoring
We have restored login services to cloud.gov. All systems should be back to normal.
Only the UAA (internal authentication) system was affected, so no external users of cloud.gov systems were impacted.
We will continue to monitor UAA availability closely, allocate more resources to UAA, and follow up with a post-mortem.
Posted Oct 11, 2017 - 09:58 EDT
Investigating
Logins to cloud.gov are currently timing out. We are investigating.

All applications/systems running on the platform do not seem to be impacted.
Posted Oct 11, 2017 - 09:19 EDT