Slow Registry Operations
Incident Report for Quay.io
Postmortem

Summary: Quay.io experienced severe latency and significant timeouts on registry requests over a period of a few days, starting on the morning of July 26th, and ending the afternoon of July 27th. The cause appears to have been a networking issue between production machines running in AWS us-east-1 and the database running under RDS in the same region. Having failed to resolve the issue via repeated requests to Amazon, workarounds were deployed to reduce database connection count and enable connection pooling, which mitigated the underlying issue.

Detailed description

Initial Alert and handling

Starting on the morning (EST) of July 26th, the alerting system paged the Quay.io on-call engineers indicating our automated monitoring detected an increase in latency and occasional timeouts on registry operations. After a brief investigation, a production machine that was holding a large number of database connections was found. Following standard procedure, the on-call engineer removed that machine from our fleet. Past events suggested this would likely fix the latency issue: Quay.io runs behind a load balancer which round-robins traffic between our production machines, and the team has seen instances in the past where one EC2 machine will start exhibiting network problems when speaking to RDS (despite both being in the same region). Removing the offending machine results in a new machine taking its place, and thus (typically) prevents database connections from hanging. Following the removal of the machine in the current situation, connection counts dropped and latency returned to normal for around 20 minutes.

Problem persists

Following a brief interlude, the latency problem returned, with more machines exhibiting unusually large number of connections to the database, and random requests (which typically took no more than 500ms) occasionally taking as long as 20 seconds. As Quay.io had not had any recent deployments, we were confident that the problem was not due to a change we had made. A ticket was immediately opened with Amazon support, and database failover was performed to change our database instance, in case it had entered an odd state. The failover helped mitigate the problem for a few minutes, after which it returned.

While awaiting a response from Amazon, the engineering team broke into two groups, with the on-call engineers redeploying our production machines in an effort to keep production traffic flowing while we tracked down the root cause. It was quickly determined that the process of making (and breaking) connections to the database was occasionally hanging on nodes under high load; operations that should have taken no more than a few ms were instead taking 10s of seconds. As investigation began into the reasons for the delay, work was done in parallel to reduce the number of database connections that the software makes to the database. Deployment of this change and a followup change took some time (as Quay deploys from itself), but eventually the outstanding connection count was reduced significantly. Unfortunately, while this reduced the number of connections, the latency to the database remained. Various other solutions were tried during this period, including scaling of our production node sizes, switching OS versions, and changing the networking settings on RDS; all had minimal effect on the underlying issue.

Mitigation

Despite our efforts, we could not pin down any root cause for the sudden increase in latency. Realizing that we could not immediately fix the underlying issue, we then set out to work around it.

Quay.io had previously attempted to enable connection pooling a few years ago; the attempt had been rolled back when it was determined that there was (ironically) a connection issue when using the feature via our underlying ORM. Since that time, no additional attempt at enabling connection pooling had been made. However, as it appeared the underlying issue was related to making connections to the database, the team decided to make another attempt, in the hopes that once a connection was made, it could be reused. Further, checking the source for our ORM showed that a number of issues previously associated with connection pooling had been patched, so we were reasonably confident the previous issue would no longer be relevant.

A new release of Quay was built (manually, as Quay normally builds itself), pushed to Quay and deployed as a single instance in the cluster. Once deployed, the new instance was monitored to ensure that no new exceptions were raised and to check that the latency problem was no longer present. Verification showed that requests were indeed uniformly fast, and no new problems were appearing under heavy production workloads. Following verification, the remainder of the cluster was swapped for machines running the new image, and the latency problem disappeared shortly thereafter.

Returning to normal service

While the deployment of the mitigation solved the immediate registry latency issues, the build queue was still heavily overloaded due to the number of queued builds that could only now proceed. It took an additional eight hours for the build pipeline to process the backlog, after which the service returned to its normal build queue latency.

Changes to Quay procedure

As the underlying cause of the latency and outage is still unknown (but appears to be unrelated to changes made by the Quay team), we are unable to make direct changes to Quay procedure to prevent such issues from occurring again. That being said, we recognize that connection pooling should have been tried earlier in the process, but had hesitation due to past issues encountered. We’ve initiated procedure changes to ensure that if we need to deploy a workaround fix in the future, we can do so faster and with less hesitation, thus allowing us to resolve the situation more quickly.

We once again apologize for the inconvenience and frustration that this issue caused to all our customers, and hope in the future we can address such oddities at a faster pace.

Posted over 1 year ago. Aug 07, 2017 - 14:24 EDT

Resolved
We have resolved the issue with registry traffic internally. We will be writing up and publishing a post-mortem in the coming days. All registry operations are now-nominal and the build queue has caught up with current requests. We are very sorry about the inconvenience.
Posted over 1 year ago. Jul 28, 2017 - 12:02 EDT
Update
We've deployed an additional change to further reduce latency on registry requests. Builds have accelerated as a result, but the queue is still quite deep. We will continue to monitor both latency and the build queue size.
Posted over 1 year ago. Jul 27, 2017 - 17:23 EDT
Update
We're continuing to monitor occasional slowdowns in registry requests. In addition, the previous slowdowns have caused a large backlog in builds, which we are currently processing.
Posted over 1 year ago. Jul 27, 2017 - 11:35 EDT
Monitoring
Quay.io has recovered for the time being. We are continuing to monitor the situation as time goes on. There is currently a large build backlog due to the registry issues from earlier.
Posted over 1 year ago. Jul 26, 2017 - 19:41 EDT
Update
We've deployed a temporary fix. Continuing to investigate and following deploy a larger fix. Thank you for your patience and we apologize for the inconvenience.
Posted over 1 year ago. Jul 26, 2017 - 19:35 EDT
Investigating
We are currently experiencing slow registry operations as a result of continued heavy database usage. We are sorry for the inconvenience while we address this issue.
Posted over 1 year ago. Jul 26, 2017 - 09:28 EDT