vLLM has quickly become the go-to inference engine for developers who need high-throughput LLM serving. We brought vLLM to Docker Model Runner for NVIDIA GPUs on Linux, then extended it to Windows via WSL2.
That changes today. Docker Model Runner now supports vllm-metal, a new backend that brings vLLM inference to macOS using Apple Silicon’s Metal GPU. If you have a Mac with an M-series chip, you can now run MLX models through vLLM with the same OpenAI-compatible API, same Anthropic-compatible API for tools like Claude Code, and all in one, the same Docker workflow.
What is vllm-metal?
vllm-metal is a plugin for vLLM that brings high-performance LLM inference to Apple Silicon. Developed in collaboration between Docker and the vLLM project, it unifies MLX, the Apple’s machine learning framework, and PyTorch under a single compute pathway, plugging directly into vLLM’s existing engine, scheduler, and OpenAI-compatible API server.
The architecture is layered: vLLM’s core (engine, scheduler, tokenizer, API) stays unchanged on top. A plugin layer consisting of MetalPlatform, MetalWorker, and MetalModelRunner handles the Apple Silicon specifics. Underneath, MLX drives the actual inference while PyTorch handles model loading and weight conversion. The whole stack runs on Metal, Apple’s GPU framework.
+-------------------------------------------------------------+
| vLLM Core |
| Engine | Scheduler | API | Tokenizers |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| vllm_metal Plugin Layer |
| +-----------+ +-----------+ +------------------------+ |
| | Platform | | Worker | | ModelRunner | |
| +-----------+ +-----------+ +------------------------+ |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| Unified Compute Backend |
| +------------------+ +----------------------------+ |
| | MLX (Primary) | | PyTorch (Interop) | |
| | - SDPA | | - HF Loading | |
| | - RMSNorm | | - Weight Conversion | |
| | - RoPE | | - Tensor Bridge | |
| | - Cache Ops | | | |
| +------------------+ +----------------------------+ |
+-------------------------------------------------------------+
|
v
+-------------------------------------------------------------+
| Metal GPU Layer |
| Apple Silicon Unified Memory Architecture |
+-------------------------------------------------------------+
Figure 1: High-level architecture diagram of vllm-metal. Credit: vllm-metal
What makes this particularly effective on Apple Silicon is unified memory. Unlike discrete GPUs where data must be copied between CPU and GPU memory, Apple Silicon shares a single memory pool. vllm-metal exploits this with zero-copy tensor operations. Combined with paged attention for efficient KV cache management and Grouped-Query Attention support, this means you can serve longer sequences with less memory waste.
vllm-metal runs MLX models published by the mlx-community on Hugging Face. These models are built specifically for the MLX framework and take full advantage of Metal GPU acceleration. Docker Model Runner automatically routes MLX models to vllm-metal when the backend is installed, falling back to the built-in MLX backend otherwise.
How vllm-metal works
vllm-metal runs natively on the host. This is necessary because Metal GPU access requires direct hardware access and there is no GPU passthrough for Metal in containers.
When you install the backend, Docker Model Runner:
- Pulls a Docker image from Hub that contains a self-contained Python 3.12 environment with vllm-metal and all dependencies pre-packaged.
- Extracts it to `~/.docker/model-runner/vllm-metal/`.
- Verifies the installation by importing the `vllm_metal` module.
When a request comes in for a compatible model, the Docker Model Runner’s scheduler starts a vllm-metal server process that communicates over TCP, serving the standard OpenAI API. The model is loaded from Docker’s shared model store, which contains all the models you pull with `docker model pull`.
Which models work with vllm-metal?
vllm-metal works with safetensors models in MLX format. The mlx-community on Hugging Face maintains a large collection of quantized models optimized for Apple Silicon. Some examples you can try:
- https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit
- https://huggingface.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit
- https://huggingface.co/mlx-community/Qwen3-Coder-Next-4bit
vLLM everywhere with Docker Model Runner
With vllm-metal, Docker Model Runner now supports vLLM across the three major platforms:
|
プラットホーム |
バックエンド |
GPU |
|---|---|---|
|
リナックス |
vllm |
NVIDIA (CUDA) |
|
Windows (WSL2) |
vllm |
NVIDIA (CUDA) |
|
マック |
vllm-metal |
Apple Silicon (Metal) |
The same docker model commands work regardless of platform. Pull a model, run it. Docker Model Runner picks the right backend for your platform.
今すぐ始める
Update to Docker Desktop 4.62 or later for Mac, and install the backend:
docker model install-runner --backend vllm-metal
Check out the Docker Model Runner documentation to learn more. For contributions, feedback, and bug reports, visit the docker/model-runner repository on GitHub.
Giving Back: vllm-metal is Now Open Source
At Docker, we believe that the best way to accelerate AI development is to build in the open. That is why we are proud to announce that Docker has contributed the vllm-metal project to the vLLM community. Originally developed by Docker engineers to power Model Runner on macOS, this project now lives under the vLLM GitHub organization. This ensures that every developer in the ecosystem can benefit from and contribute to high-performance inference on Apple Silicon. The project also has had significant contributions by Lik Xun Yuan, Ricky Chen and Ranran Haoran Zhang.
The $599 AI Development Rig
For a long time, high-throughput vLLM development was gated behind a significant GPU cost. To get started, you typically need a dedicated Linux box with an RTX 4090 ($1,700+) or enterprise-grade A100/H100 cards ($10,000+).
vllm-metal changes the math
Now, a base $599 Mac Mini with an M4 chip becomes a viable vLLM development environment. Because Apple Silicon uses Unified Memory, that 16GB (or upgraded 32GB/64GB) of RAM is directly accessible by the GPU. This allows you to:
- Develop & Test Locally: Build your vLLM-based applications on the same machine you use for coding.
- Production-Mirroring: Use the exact same OpenAI-compatible API on your Mac Mini as you would on an H100 cluster in production.
- Energy Efficiency: Run inference at a fraction of the power consumption (and heat) of a discrete GPU rig.
How does vllm-metal compare to llama.cpp?
We benchmarked both backends using Llama 3.2 1B Instruct with comparable 4-bit quantization, served through Docker Model Runner on Apple Silicon.
|
llama.cpp |
vLLM-Metal |
|
|---|---|---|
|
モデル |
unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_0 |
mlx-community/llama-3.2-1b-instruct-4bit |
|
フォーマット |
GGUF (Q4_0) |
Safetensors (MLX 4-bit) |
Throughput (tokens/sec, wall-clock)
|
max_tokens |
llama.cpp |
vLLM-Metal |
スピードアップ |
|---|---|---|---|
|
128 |
333。3 |
251。5 |
1.3x |
|
512 |
345。1 |
279。0 |
1.3x |
|
1024 |
338。5 |
275。4 |
1.2x |
|
2048 |
339。1 |
279。5 |
1.2x |
Each configuration was run 3 times across 3 different prompts (9 total requests per data point).
Throughput is measured as completion_tokens / wall_clock_time, applied consistently to both backends.
Key observations:
- llama.cpp is consistently ~1.2x faster than vLLM-Metal across all output lengths.
- llama.cpp throughput is remarkably stable (~333-345 tok/s regardless of max_tokens), while vLLM-Metal shows more variance between individual runs (134-343 tok/s).
- Both backends scale well. Neither backend shows significant degradation as output length increases.
- Quantization methods differ (GGUF Q4_0 vs MLX 4-bit), so this benchmarks the full stack, engine + quantization, rather than the engine alone.
The benchmark script used for these results is available as a GitHub Gist.
参加方法
Docker Model Runnerの強みはコミュニティにあり、成長の余地は常にあります。参加するには:
- リポジトリに星をつける: Docker Model Runnerリポジトリに星をつけてサポートを示してください。
- アイデアを寄せてください: 問題を作成したり、プルリクエストを提出したりしてください。皆さんのアイデアを楽しみにしています!
- 広めよう: DockerでAIモデルを動かすことに興味がある友人や同僚に伝えましょう。
私たちは Docker Model Runner のこの新しい章に非常に興奮しており、一緒に何を構築できるかを見るのが待ちきれません。さあ、仕事に取り掛かりましょう!
詳細情報
- 関連投稿をお読みください: OpenCode with Docker Model Runner for Private AI Coding
- Docker Model Runner の一般提供に関するお知らせを確認する
- 当社のModel Runner GitHubリポジトリをご覧ください
- シンプルなHello GenAIアプリケーションから始めましょう