We are now confident that the service has returned to stability.
The following is the series of events that took place: 7:05AM EST - A particular machine begins crashing by exhausting its connection pool to the database 7:45AM EST - Our continuous registry push/pull monitor pages the on-call due to timeouts and investigation begins 8:25AM EST - The machine exhausting its connection pool is manually terminated by the on-call SRE 8:30AM EST - A cascading failure causes all app servers to go out of service causing a full service outage 8:35AM EST - Few app servers manage to enter service returning the service to a partial outage 9:02AM EST - Full service is restored as the desired number of healthy app servers are now in service
We are continuing our investigation internally to determine the root cause of the single machine that caused a cascading failure.
Posted Aug 28, 2019 - 11:25 EDT
We are monitoring the service after a recovering from cascading failure. The registry and API should be operating as normal.
Posted Aug 28, 2019 - 09:07 EDT
We are currently investigating a full outage for the registry and API service.