We’ve all been there: you’re 90% of the way through downloading a massive, multi-gigabyte GGUF model file for llama.cpp when your internet connection hiccups. The download fails, and the progress bar resets to zero. It’s a frustrating experience that wastes time, bandwidth, and momentum.
Well, the llama.cpp community has just shipped a fantastic quality-of-life improvement that puts an end to that frustration: resumable downloads!
This is a significant step forward for making large models more accessible and reliable to work with. Let’s take a quick look at what this new feature does and then explore how to achieve a truly seamless, production-grade model management workflow with Docker.
What’s New in Llama.cpp pulling?
Based on a recent pull request, the file downloading logic within llama.cpp has been overhauled to be more robust and efficient.
Previously, if a download was interrupted, you had to start over from the beginning. Even worse, if a new version of a model was released at the same URL, the old file would be deleted entirely to make way for the new one, forcing a complete re-download.
The new implementation is much smarter. Here are the key improvements:
- Resumable Downloads: The downloader now checks if the remote server supports byte-range requests via the Accept-Ranges HTTP header. If it does, any interrupted download can be resumed exactly where it left off. No more starting from scratch!
- Smarter Updates: It still checks for remote file changes using ETag and Last-Modified headers, but it no longer immediately deletes the old file if the server doesn’t support resumable downloads.
- Atomic File Writes: The code now writes downloads and metadata files to a temporary location before atomically renaming them. This prevents file corruption if the program is terminated mid-write, ensuring the integrity of your model cache.
This is an enhancement that makes the ad-hoc experience of fetching models from a URL much smoother. However, as you move from experimentation to building real applications, managing models via URLs can introduce challenges around versioning, reproducibility, and security. That’s where a fully integrated Docker workflow comes in.
From Better Downloads to Best-in-Class Model Management
While the new llama.cpp feature fixes the delivery of a model from a URL, it doesn’t solve the higher-level challenges of managing the models themselves. You’re still left asking:
- Is this URL pointing to the exact version of the model I tested with?
- How do I distribute this model to my team or my production environment reliably?
- How can I treat my AI models with the same rigor as my application code and container images?
For a complete, Docker-native experience, the answer is Docker Model Runner.
The Docker-Native Way: Docker Model Runner
The Docker Model Runner is a tool that lets you manage, run, and distribute AI models using Docker Desktop (via GUI or CLI) or Docker CE and ecosystem you already know and love. It bridges the gap between AI development and production operations by treating models as first-class citizens alongside your containers.
Instead of depending on an application’s internal downloader and pointing it at a URL, you can manage models with familiar commands and enjoy powerful benefits:
- OCI Push and Pull Support: Docker Model Runner treats models as Open Container Initiative (OCI) artifacts. This means you can store them in any OCI-compliant registry, like Docker Hub. You can docker model push and docker model pull your models just like container images.
- Versioning and Reproducibility: Tag your models with versions (e.g., my-company/my-llama-model:v1.2-Q4_K_M). This guarantees that you, your team, and your CI/CD pipeline are always using the exact same file, ensuring reproducible results. The URL to a file can change, but a tagged artifact in a registry is immutable.
- Simplified and Integrated Workflow: Pulling and running a model becomes a single, declarative command. Model Runner handles fetching the model from the registry and mounting it into the container for llama.cpp to use.
Here’s how simple it is to run a model from Docker Hub using the llama.cpp image with Model Runner:
# Run a Gemma 3 model, asking it a question
# Docker Model Runner will automatically pull the model
docker model run ai/gemma3 "What is the Docker Model Runner?"
The resumable download feature in llama.cpp is a community contribution that makes getting started easier. When you’re ready to level up your MLOps workflow, embrace the power of Docker Model Runner for a truly integrated, reproducible, and scalable way to manage your AI models. Resumable downloads is a feature we are working on in Docker Model Runner to enhance the pulling experience in a Docker-Native way.
We’re Building This Together!
Docker Model Runner is a community-friendly project at its core, and its future is shaped by contributors like you. If you find this tool useful, please head over to our GitHub repository. Show your support by giving us a star, fork the project to experiment with your own ideas, and contribute. Whether it’s improving documentation, fixing a bug, or a new feature, every contribution helps. Let’s build the future of model deployment together!
Learn more:
- Check out Docker Model Runner General Availability announcement
- Visit our Model Runner GitHub repo! Docker Model Runner is open-source, and we welcome collaboration and contributions from the community!
- Read more about our blog on llama.cpp’s support for pulling GGUF models directly from Docker Hub