Embeddings have become the backbone of many modern AI applications. From semantic search to retrieval-augmented generation (RAG) and intelligent recommendation systems, embedding models enable systems to understand the meaning behind text, code, or documents, not just the literal words.
But generating embeddings comes with trade-offs. Using a hosted API for embedding generation often results in reduced data privacy, higher call costs, and time-consuming model regeneration. When your data is private or constantly evolving (think internal documentation, proprietary code, or customer support content), these limitations quickly become blockers.
Instead of sending data to a remote service, you can easily run local embedding models on-premises with Docker Model Runner. Model Runner brings the power of modern embeddings to your local environment, giving you privacy, control, and cost-efficiency out of the box.
In this post, you’ll learn how to use embedding models for semantic search. We’ll start by covering the theory behind embedding and why developers should run them. Then, we’ll wrap up with a practical example, using Model Runner, to help you get started.
Understanding semantic search embeddings
Let’s take a moment to first demystify what embeddings are.
Embeddings represent words, sentences, and even code as high-dimensional numerical vectors that capture semantic relationships. In this vector space, similar items cluster together, while dissimilar ones are farther apart.
For example, a traditional keyword search looks for exact matches. If you search for “authentication”, you’ll only find documents containing that exact term. But with embeddings, searching for “user login” might also surface results about authentication, session management, or security tokens because the model understands that these are semantically related ideas.
This makes embeddings the foundation for more intelligent search, retrieval, and discovery — where systems understand what you mean, not just what you type.
For a deeper perspective on how language and meaning intersect in AI, check out “The Language of Artificial Intelligence”.
How Vector Similarity Enables Semantic Search with Embeddings
Here’s where the math behind semantic search comes in, and it’s elegantly simple.
Once text is converted into vectors (lists of numbers), we can measure how similar two pieces of text are using cosine similarity:
ここで:
- A is your query vector (e.g., “user login”),
- B is another vector (e.g., a code snippet or document).
The result is a similarity score, typically between 0 and 1, where values closer to 1 mean the texts are more similar in meaning.
In practice:
- A search query and a relevant document will have a high cosine similarity.
- Irrelevant results will have low similarity.
This simple mathematical measure allows you to rank documents by how semantically close they are to your query, which powers features like:
- Natural language search over docs or code
- RAG pipelines that retrieve contextually relevant snippets
- Deduplication or clustering of related content
With Model Runner, you can generate these embeddings locally, feed them into a vector database (like Milvus, Qdrant, or pgvector), and start building your own semantic search system without sending a single byte to a third-party API.
Why use Docker Model Runner to run embedding models
With Model Runner, you don’t have to worry about setting up environments or dependencies. Just pull a model, start the runner, and you’re ready to generate embeddings, all inside a familiar Docker workflow.
Full data privacy
Your sensitive data never leaves your environment. Whether you’re embedding source code, internal documents, or customer content, you can rest assured that everything stays local — no third-party API calls, no network exposure.
Zero cost per embedding
There are no usage-based API costs. Once you have the model running locally, you can generate, update, or rebuild your embeddings as often as you need, at no extra cost.
That means iterating on your dataset or experimenting with new prompts won’t affect your budget.
Performance and control
Run the model that best fits your use case, leveraging your own CPU or GPU for inference.
Models are distributed as OCI artifacts, so they integrate seamlessly into your existing Docker workflows, CI/CD pipelines, and local development setups. This means you can manage and version models just like any other container image, ensuring consistency and reproducibility across environments.
Model Runner lets you bring models to your data, not the other way around, unlocking local, private, and cost-effective AI workflows.
Hands-on: Generating embeddings with Docker Model Runner
Now that we understand what embeddings are and how they capture semantic meaning, let’s see how simple it is to generate embeddings locally using Model Runner.
Step 1. Pull the model
docker model pull ai/qwen3-embedding
Step 2. Generate Embeddings
You can now send text to this endpoint via curl or your preferred HTTP client:
curl http://localhost:12434/engines/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "ai/qwen3-embedding",
"input": "A dog is an animal"
}'
The response will include a list of embedding vectors, which is a numerical representation of your input text.
You can store these vectors in a vector database like Milvus, Qdrant, or pgvector to perform semantic search or similarity queries.
Example use case: Semantic search over your codebase
Let’s make it practical.
Imagine you want to enable semantic code search across your project repository.
The process will look like:
Step 1. Chunk and embed your code
Split your codebase into logical chunks. Generate embeddings for each chunk using your local Docker Model Runner endpoint.
Step 2. Store embeddings
Save those embeddings along with metadata (file name, path, etc.). You would usually use a Vector Database to store these embeddings, but in this demo, we’re going to store them in a file for simplicity.
Step 3. Query by meaning
When a developer searches “user login”, you embed the query and compare it to your stored vectors using cosine similarity.
We have included a demo in the Docker Model Runner repository that does exactly that.
Figure 1: Codebase example demo with embeddings stats, example queries, and search results.
結論
Embeddings help applications work with intelligent meaning, not just keywords. The old hassle was wiring up third-party APIs, juggling data privacy, and watching per-call costs creep up.
Docker Model Runner flips the script. Now, you can run embedding models locally where your data lives with full control over your data and infrastructure. Ship semantic search, RAG pipelines, or custom search with a consistent Docker workflow — private, cost-effective, and reproducible.
No usage fees. No external dependencies. By bringing models directly to your data, Docker makes it easier than ever to explore, experiment, and innovate, safely and at your own pace.
How you can get involved
Docker Model Runnerの強みはコミュニティにあり、成長の余地は常にあります。このプロジェクトを最高のものにするために、皆さんのご協力が必要です。参加するには、以下の方法があります:
- リポジトリにスターを付けます。 サポートを示し、 Docker Model Runnerリポジトリにスターを付けて可視性を高めるのにご協力ください。
- アイデアを投稿してください。 新機能やバグ修正のアイデアはありますか?問題を作成して議論します。または、リポジトリをフォークし、変更を加えて、pull request を送信します。私たちはあなたがどんなアイデアを持っているかを見るのを楽しみにしています!
- 言葉を広める: 友人、同僚、および Docker で AI モデルを実行することに興味がある可能性のある人に伝えてください。
私たちは Docker Model Runner のこの新しい章に非常に興奮しており、一緒に何を構築できるかを見るのが待ちきれません。さあ、仕事に取り掛かりましょう!
Get started with Docker Model Runner →
さらに詳しく
- Check out Docker Model Runner integration with vLLM announcement
- Model Runner GitHub リポジトリにアクセスしてください。Docker Model Runner はオープンソースであり、コミュニティからのコラボレーションと貢献を歓迎します。
- シンプルなhello GenAIアプリケーションでDocker Model Runnerを使い始める