This is Docker’s first time publishing an incident report publicly. While we have always done detailed post mortems on incidents internally, as part of the changing culture at Docker, we want to be more open externally as well. For example, this year we have started publishing our roadmap publicly and asking our users for their input. You should expect to see us continue publishing reports for most significant incidents.
In publishing these reports, we hope others can learn from the issues we have faced and how we have dealt with them. We hope it builds trust in our services and our teams. We also think this one is pretty interesting due to the complex interaction between multiple services and stakeholders.
Amazon Linux users in several regions encountered intermittent hanging downloads of Docker images from the Docker Hub registry between roughly July 5 19:00 UTC and July 6 06:30 UTC. The issue stemmed from an anti-botnet protection mechanism our CDN provider Cloudflare had deployed. Teams from Docker, Cloudflare, and AWS worked together to pinpoint the issue and the mechanism in question was disabled, leading to full service restoration.
At about 01:45 UTC on Monday July 6th (Sunday evening Pacific time), Docker was contacted by AWS about image pulls from Docker Hub failing for a number of their services and users. Both the Docker Hub and Infrastructure teams immediately started digging into the problem.
The initial troubleshooting step was of course to try doing image pulls from our local machines. These all worked, and combined with our monitoring and alerting showing no issues, ruled out a service-wide issue with the registry.
Next, we checked pulls in our own infrastructure running in AWS. As expected by the lack of alarms in our own monitoring, this worked. This told us that the issue was more specific than “all AWS infrastructure” – it was either related to region or a mechanism in the failing services themselves.
Based on some early feedback from AWS engineers that the issue affected systems that used Amazon Linux (including higher level services like Fargate), the Docker team started spinning up instances with Amazon Linux and another OS in multiple AWS regions. Results here showed two things – both operating systems in AWS region us-east-1 worked fine, and in some other regions, Amazon Linux failed to pull images successfully where the other OS worked fine.
The fact that us-east-1 worked for both operating systems told us the problem was related to our CDN, Cloudflare. This is because Docker Hub image data is stored in S3 buckets in us-east-1, so requests from that region are served directly from S3. Other regions, where we saw issues, were served via the CDN. Docker opened an incident with Cloudflare at 02:35 UTC.
Because we only observed the issue on Amazon Linux, engineers from all three companies began digging into the problem to figure out what the interaction between that OS and Cloudflare was. A number of avenues were examined – was Amazon Linux using custom docker/containerd packages? No. Did the issue still exist when replicating a pull using curl rather than Docker Engine? Yes. It’s now pretty clear that the issue is some sort of low-level network implementation detail, and all teams start focusing on this.
At about 05:00 UTC, engineers from AWS examining networking differences between Amazon Linux and other operating systems discover that modifying a network packet attribute to match other systems makes the issue disappear. This info is shared with Cloudflare.
Cloudflare investigates given this new information and finds that some traffic to Docker Hub is being dropped due to an anti-botnet mitigation system. This system had recently had a new detection signature added that flagged packets with a certain attribute as potentially part of an attack.
The fact that packets from Amazon Linux matched this signature combined with the large scale of traffic to Docker Hub activated this mechanism in several regions. While Cloudflare had been monitoring this change for some time before enabling it, this interaction had not been uncovered before the mechanism was switched from monitoring to active.
Cloudflare then disabled this mechanism for Docker Hub traffic and all parties confirmed full resolution at about 06:30 UTC.
So what did we learn?
First, we learned that our visibility into end-user experience of our CDN downloads was limited. In our own monitoring, we’ve identified that we can track increases in 206 response codes to indicate that such an issue may be occurring; when downloads hang, the client often attempts to reconnect and download the partial content it did not previously receive. This monitoring is now in place, and this information will lead to much quicker resolution in future such incidents.
Additionally, we will work with Cloudflare to increase our visibility into mitigations that are actively affecting traffic for Docker Hub.
Lastly – this reaffirmed that the internet is a complicated place. This issue involved details all the way up from low-level network implementation to higher-order services that abstract away such things. Never underestimate the impact of every layer of your stack and it’s dependencies on other parties.
We’d like to thank our partners at Cloudflare and AWS for their work in diagnosing the problem. It took all three parties working together to resolve this issue for our joint users. We know developers and operators rely on many different providers to power their systems, and the closer we work together, the better we can support you.
0 thoughts on "Docker Hub Incident Review – 5 July 2020"