On the afternoon of March 22, 2021, the quay.io team noticed a gradual increase in database connections. The number of connections continued to grow while the team attempted to resolve the issue. Eventually, the connection load became too great against the database instance and the service experienced an outage. The database was restarted and attempts to restart the quay.io pods continued to increase the number of database connections. The service was restored in read-only mode while the team continued to investigate the state of the database. Shortly thereafter, read/write mode was enabled on the service and the incident was resolved.
The root cause of the incident was traced back to a combination of factors. Just prior to the database connections climbing, the quay.io team deployed a minor configuration change requiring all of the pods to be restarted. Each quay.io pod maintains its own separate cache service to reduce load against the database. When the pods were restarted, these caches had to be re-created directly from the database. The net effect of all quay.io pods restarting with cold caches had typically been well within the abilities of our database server; however, because of the recent growth in quay.io usage, the database server was not sufficiently sized to handle the increased load of the pods recreating their caches simultaneously. The quay.io traffic is continuous because it is used by many CI/CD systems, hence there was no way for the database server to keep up with the rate of incoming traffic while the caches warmed. Queries began to back up, causing connections to stay active longer than usual.
Deeper investigation into the issue showed that a particular database query, which is relevant for image layer fetching, had begun taking an order of magnitude longer than usual. Analysis showed the query attempting to join across two tables with a massive number of rows between them. The resulting query plan chosen by the optimizer entailed scanning the entirety of one of the tables in order to complete. This slow query performance exacerbated the impact of the cache warming activity, eventually causing the database server to crash.
The service was restored by moving the quay.io database to a larger instance size. This additional server capacity was able to withstand the effects of the slower query performance as well as the onslaught of traffic during pod restarts. Currently quay.io is operating on this larger instance size while longer-term changes are being rolled out. Simply increasing the instance size of the quay.io database is acknowledged as not a sustainable way to keep up with service growth.
The quay.io team identified several key areas for improvement and created the following plan of action:
We have already completed much of the engineering work to address these learnings and will be rolling out service improvements in the near future. The quay.io team continues to focus on how we can improve the scalability and resiliency of our service as usage continues to increase. Service outages are never a welcome event, and the quay.io engineering team seeks to learn as much as possible from each incident to ensure our platform and processes continuously improve in order to offer the best customer experience possible.