We scheduled a maintenance window for updating the component we use for orchestrating and automating deployments of the rest of cloud.gov (BOSH). In addition to deploying other parts of the platform, this orchestration component also supplies domain name resolution (DNS) services so that the other components of the platform can contact each other without knowing their exact location (IP Address) within the larger system, including the internet. Ordinarily we do not expect that updating the orchestrator would cause problems in the availability of customer applications.
This time, the orchestrator was offline for a longer period than usual. Systems that depend on DNS have some caching mechanisms built in, but the orchestrator was unavailable for longer than the caching period. The system could not look up other names, including customer application names, until the orchestrator upgrade had completed. This caused customer applications to be unavailable during the upgrade.
We've verified that there is only a single component that is relying on the orchestrator for DNS. We will be phasing out the use of names for contacting this component and instead use its IP Address, so we can also phase out this dependence on our orchestrator for DNS. This will allow use to use a more highly-available DNS system for all components of the platform.