Summary: Quay.io experienced severe latency and significant timeouts on registry requests over a period of a few days, starting on the morning of July 26th, and ending the afternoon of July 27th. The cause appears to have been a networking issue between production machines running in AWS us-east-1 and the database running under RDS in the same region. Having failed to resolve the issue via repeated requests to Amazon, workarounds were deployed to reduce database connection count and enable connection pooling, which mitigated the underlying issue.
Starting on the morning (EST) of July 26th, the alerting system paged the Quay.io on-call engineers indicating our automated monitoring detected an increase in latency and occasional timeouts on registry operations. After a brief investigation, a production machine that was holding a large number of database connections was found. Following standard procedure, the on-call engineer removed that machine from our fleet. Past events suggested this would likely fix the latency issue: Quay.io runs behind a load balancer which round-robins traffic between our production machines, and the team has seen instances in the past where one EC2 machine will start exhibiting network problems when speaking to RDS (despite both being in the same region). Removing the offending machine results in a new machine taking its place, and thus (typically) prevents database connections from hanging. Following the removal of the machine in the current situation, connection counts dropped and latency returned to normal for around 20 minutes.
Following a brief interlude, the latency problem returned, with more machines exhibiting unusually large number of connections to the database, and random requests (which typically took no more than 500ms) occasionally taking as long as 20 seconds. As Quay.io had not had any recent deployments, we were confident that the problem was not due to a change we had made. A ticket was immediately opened with Amazon support, and database failover was performed to change our database instance, in case it had entered an odd state. The failover helped mitigate the problem for a few minutes, after which it returned.
While awaiting a response from Amazon, the engineering team broke into two groups, with the on-call engineers redeploying our production machines in an effort to keep production traffic flowing while we tracked down the root cause. It was quickly determined that the process of making (and breaking) connections to the database was occasionally hanging on nodes under high load; operations that should have taken no more than a few ms were instead taking 10s of seconds. As investigation began into the reasons for the delay, work was done in parallel to reduce the number of database connections that the software makes to the database. Deployment of this change and a followup change took some time (as Quay deploys from itself), but eventually the outstanding connection count was reduced significantly. Unfortunately, while this reduced the number of connections, the latency to the database remained. Various other solutions were tried during this period, including scaling of our production node sizes, switching OS versions, and changing the networking settings on RDS; all had minimal effect on the underlying issue.
Despite our efforts, we could not pin down any root cause for the sudden increase in latency. Realizing that we could not immediately fix the underlying issue, we then set out to work around it.
Quay.io had previously attempted to enable connection pooling a few years ago; the attempt had been rolled back when it was determined that there was (ironically) a connection issue when using the feature via our underlying ORM. Since that time, no additional attempt at enabling connection pooling had been made. However, as it appeared the underlying issue was related to making connections to the database, the team decided to make another attempt, in the hopes that once a connection was made, it could be reused. Further, checking the source for our ORM showed that a number of issues previously associated with connection pooling had been patched, so we were reasonably confident the previous issue would no longer be relevant.
A new release of Quay was built (manually, as Quay normally builds itself), pushed to Quay and deployed as a single instance in the cluster. Once deployed, the new instance was monitored to ensure that no new exceptions were raised and to check that the latency problem was no longer present. Verification showed that requests were indeed uniformly fast, and no new problems were appearing under heavy production workloads. Following verification, the remainder of the cluster was swapped for machines running the new image, and the latency problem disappeared shortly thereafter.
While the deployment of the mitigation solved the immediate registry latency issues, the build queue was still heavily overloaded due to the number of queued builds that could only now proceed. It took an additional eight hours for the build pipeline to process the backlog, after which the service returned to its normal build queue latency.
As the underlying cause of the latency and outage is still unknown (but appears to be unrelated to changes made by the Quay team), we are unable to make direct changes to Quay procedure to prevent such issues from occurring again. That being said, we recognize that connection pooling should have been tried earlier in the process, but had hesitation due to past issues encountered. We’ve initiated procedure changes to ensure that if we need to deploy a workaround fix in the future, we can do so faster and with less hesitation, thus allowing us to resolve the situation more quickly.
We once again apologize for the inconvenience and frustration that this issue caused to all our customers, and hope in the future we can address such oddities at a faster pace.