Sporadic image pull failures
Incident Report for Quay.io
Postmortem

On Sunday, November 12, 2023, the Quay engineering team migrated quay.io’s database from its original MySQL RDS instance (circa 2013) to an Aurora Postgres instance. We did this because maintaining an older version of MySQL and updating the database without causing downtime was becoming increasingly difficult. This migration was a large undertaking that required months of planning. Because quay.io’s database holds metadata linking all customer images to their underlying layer blobs, losing any of this information would have been catastrophic.

To facilitate the migration between databases we used Amazon’s Data Migration Service (DMS), which is specifically designed for this type of task. The day after we went live on Aurora Postgres, we discovered that some images were not pulling successfully. Upon investigation, we discovered that these failures were due to truncated manifests in our new database.

This truncation occurred because DMS had been configured to limit the size of some MySQL text fields to 32K when they were converted to Postgres. This resulted in very large manifests being truncated. New image pushes were not affected as these were being written directly to the database.

We immediately wrote and executed a script to traverse our old MySQL database and reconcile all manifests against our Postgres database, correcting any truncated manifests. This was a large task, as quay.io currently holds over 60 million manifests. This work was completed on November 16th.

While only a small subset of images have been affected, we are deeply sorry for any inconvenience this may have caused.

This database migration has been an interesting journey in how to migrate a critical architectural component while trying to keep quay.io functioning. We will be sharing a write up soon with more details on our migration process, the issues we encountered, and which improvements we look forward to making in the coming months.

Posted Nov 18, 2023 - 09:27 EST

Resolved
The issue with image pulls has been resolved. We will however need to keep builders disabled until Monday 11/20. We apologize for the inconvenience.
Posted Nov 16, 2023 - 16:01 EST
Monitoring
Our fix has been completed. Quay.io is operating correctly for pushes and pulls, all write operations should work with the exception of builds. We are continuing to evaluate an issue with builds at this time.
Posted Nov 14, 2023 - 18:50 EST
Update
We are continuing to resolve the pull issue and investigating potential issues with our CDN. Thanks for your patience.
Posted Nov 14, 2023 - 17:32 EST
Update
We are experiencing some instability and are moving to read only mode.
Posted Nov 14, 2023 - 15:59 EST
Update
We are continuing to resolve this issue. Thanks for your patience and we apologize for the inconvenience.
Posted Nov 14, 2023 - 08:55 EST
Identified
We have identified an issue where some pulls of older images may not succeed. We have identified the issue and are working on a fix. We apologize for the inconvenience.
Posted Nov 13, 2023 - 17:04 EST
This incident affected: Registry, API, and Build System.