Over the last decade, the popularity of microservices and highly-scalable systems has increased, leading to an overall increase in the complexity of applications that are now distributed heavily across the network with many moving pieces and potential failure modes.
This architectural evolution has changed the monitoring requirements and led to a need for scalable and insightful tooling and practices to enable us to better identify, debug and resolve issues in our systems before they impact the business and our end users (internal and/or external).
I recently gave a talk at DockerCon SF 18 discussing functionality in Docker Enterprise that enables operators to more easily monitor their container platform environment, along with some key metrics and best practices to triage and remediate issues before they cause downtime.
You can watch the full talk here:
One of the most well-known early monitoring techniques was the USE method from Brendan Gregg at Netflix. USE specified that for every resource we should be monitoring utilization (time spent servicing work), saturation (the degree to which a resource had work it couldn’t service) and errors (number of error events). This model worked well for more hardware / node centric metrics but network-based applications required a tweaked model.
One of the most popular models for more network-orientated cloud-native apps is the ‘4 Golden Signals’ (Latency, Traffic, Errors & Saturation) as called out in the Google SRE Handbook. These monitoring methods are useful at the application and platform level, but still lack some of the detail we need to triage complex scenarios & failures.
Application and Platform Observability
Observability goes a step further than simple metrics and is the measure of how well we can infer the state of our systems by reviewing their outputs. Observability comprises monitoring, logging (events), tracing and alerting to build a full picture of the state of the system. For our applications to be ‘observable’ it’s important to instrument them so that we can pull out key information and analyze it. Recent years have seen a tooling renaissance in this area with the likes of DataDog, Instana, Prometheus, Sumo Logic and many others catering the increased need for advanced functionality in these areas.
Observability in Docker Enterprise
The Docker Enterprise container platform has a number of features built-in to enable easier monitoring and metric measurement. A couple of really useful ones are health checks, engine metrics and logs:
Health checks: Health checks are built into the Dockerfile specification and allow users to write monitoring checks against their applications. This information is reported through the engine and up through the Docker Enterprise web admin UI. Docker Enterprise will automatically reschedule workloads that fail health checks.
Engine metrics: The Docker Engine – Enterprise exposes an endpoint that emits Prometheus-formatted metrics data for easy integration into monitoring tooling. There are hundreds of individual metrics available, including data about builds, Swarm status (to detect when leaders are down, loss of quorum etc…), daemon events (e.g. network creation) and many more.
Logging: Docker Enterprise has built-in support for a number of different logging drivers, including the ability to tag services with metadata to make querying easier once the logs are shipped to an aggregator.
Observability of Docker @ Docker
Our infrastructure team run Docker Hub & Store in the cloud and see an incredible amount of traffic running through the platform, with over one billion image pulls every two weeks. Below are some of the stats across our production environmen
This all runs on the Docker Enterprise platform and utilizes many of the tools and techniques outlined above.
To learn more about how Docker Enterprise can help you streamline container operations:
- Learn more about Docker Enterprise
- Try Docker Enterprise for free
- Discover more about health checks, engine metrics & logging drivers