How to solve the context size issues with context packing with Docker Model Runner and Agentic Compose

Posted Feb 13, 2026

If you’ve worked with local language models, you’ve probably run into the context window limit, especially when using smaller models on less powerful machines. While it’s an unavoidable constraint, techniques like context packing make it surprisingly manageable.

Hello, I’m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker.  In my previous blog post, I wrote about how to make a very small model useful by using RAG. I had limited the message history to 2 to keep the context length short.

But in some cases, you’ll need to keep more messages in your history. For example, a long conversation to generate code:

- generate an http server server in golang
- add a human structure and a list of humans
- add a handler to add a human to the list
- add a handler to list all humans
- add a handler to get a human by id
- etc...

Let’s imagine we have a conversation for which we want to keep 10 messages in the history. Moreover, we’re using a very verbose model (which a lot of tokens), so we’ll quickly encounter this type of error:

error: {
    code: 400,
    message: 'request (8860 tokens) exceeds the available context size (8192 tokens), try increasing it',
    type: 'exceed_context_size_error',
    n_prompt_tokens: 8860,
    n_ctx: 8192
  },
  code: 400,
  param: undefined,
  type: 'exceed_context_size_error'
}


What happened?

Understanding context windows and their limits in local LLMs

Our LLM has a context window, which has a limited size. This means that if the conversation becomes too long… It will bug out.

This window is the total number of tokens the model can process at once, like a short-term working memory.  Read this IBM article for a deep dive on context window

In our example in the code snippet above, this size was set to 8192 tokens for LLM engines that power local LLM, like Docker Model Runner, Ollama, Llamacpp, …

This window includes everything: system prompt, user message, history, injected documents, and the generated response. Refer to this Redis post for more info. 

Example: if the model has 32k context, the sum (input + history + generated output) must remain ≤ 32k tokens. Learn more here.  

It’s possible to change the default context size (up or down) in the compose.yml file:

models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m
    # Increased context size for better handling of larger inputs
    context_size: 16384

You can also do this with Docker with the following command: docker model configure –context-size 8192 ai/qwen2.5-coder `

And so we solve the problem, but only part of the problem. Indeed, it’s not guaranteed that your model supports a larger context size (like 16384), and even if it does, it can very quickly degrade the model’s performance.

Thus, with hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m, when the number of tokens in the context approaches 16384 tokens, generation can become (much) slower (at least on my machine). Again, this will depend on the model’s capacity (read its documentation). And remember, the smaller the model, the harder it will be to handle a large context and stay focused.

Tips: always provide an option (a /clear command for example) in your application to empty the message list, or to reduce it. Automatic or manual. Keep the initial system instructions though.

So we’re at an impasse. How can we go further with our small models?

Well, there is still a solution, which is called context packing.

Using context packing to fit more information into limited context windows

We can’t indefinitely increase the context size. To still manage to fit more information in the context, we can use a technique called “context packing”, which consists of having the model itself summarize previous messages (or entrust the task to another model), and replace the history with this summary and thus free up space in the context.

So we decide that from a certain token limit, we’ll have the history of previous messages summarized, and replace this history with the generated summary.

I’ve therefore modified my example to add a context packing step. For the exercise, I decided to use another model to do the summarization.

Modification of the compose.yml file

I added a new model in the compose.yml file: ai/qwen2.5:1.5B-F16

models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

  context-packing-model:
    model: ai/qwen2.5:1.5B-F16

Then:

  • I added the model in the models section of the service that runs our program.
  • I increased the number of messages in the history to 10 (instead of 2 previously).
  • I set a token limit at 5120 before triggering context compression.
  • And finally, I defined instructions for the “context packing” model, asking it to summarize previous messages.

excerpt from the service:

golang-expert-v3:
build:
    context: .
    dockerfile: Dockerfile
environment:

    HISTORY_MESSAGES: 10
    TOKEN_LIMIT: 5120
    # ...
   
configs:
    - source: system.instructions.md
    target: /app/system.instructions.md
    - source: context-packing.instructions.md
    target: /app/context-packing.instructions.md

models:
    chat-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_CHAT

    context-packing-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_CONTEXT_PACKING

    embedding-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_EMBEDDING

You’ll find the complete version of the file here: compose.yml

System instructions for the context packing model

Still in the compose.yml file, I added a new system instruction for the “context packing” model, in a context-packing.instructions.md file:

context-packing.instructions.md:
content: |\
    You are a context packing assistant.
    Your task is to condense and summarize provided content to fit within token limits while preserving essential information.
    Always:
    - Retain key facts, figures, and concepts
    - Remove redundant or less important details
    - Ensure clarity and coherence in the condensed output
    - Aim to reduce the token count significantly without losing critical information

    The goal is to help fit more relevant information into a limited context window for downstream processing.

All that’s left is to implement the context packing logic in the assistant’s code.

 Applying context packing to the assistant’s code

First, I define the connection with the context packing model in the Setup part of my assistant:

const contextPackingModel = new ChatOpenAI({
  model: process.env.MODEL_RUNNER_LLM_CONTEXT_PACKING || `ai/qwen2.5:1.5B-F16`,
  apiKey: "",
  configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
  },
  temperature: 0.0,
  top_p: 0.9,
  presencePenalty: 2.2,
});

I also retrieve the system instructions I defined for this model, as well as the token limit:

let contextPackingInstructions = fs.readFileSync('/app/context-packing.instructions.md', 'utf8');

let tokenLimit = parseInt(process.env.TOKEN_LIMIT) || 7168

Once in the conversation loop, I’ll estimate the number of tokens consumed by previous messages, and if this number exceeds the defined limit, I’ll call the context packing model to summarize the history of previous messages and replace this history with the generated summary (the assistant-type message: [“assistant”, summary]). Then I continue generating the response using the main model.

excerpt from the conversation loop:

 let estimatedTokenCount = messages.reduce((acc, [role, content]) => acc + Math.ceil(content.length / 4), 0);
  console.log(` Estimated token count for messages: ${estimatedTokenCount} tokens`);

  if (estimatedTokenCount >= tokenLimit) {
    console.log(` Warning: Estimated token count (${estimatedTokenCount}) exceeds the model's context limit (${tokenLimit}). Compressing conversation history...`);

    // Calculate original history size
    const originalHistorySize = history.reduce((acc, [role, content]) => acc + Math.ceil(content.length / 4), 0);

    // Prepare messages for context packing
    const contextPackingMessages = [
      ["system", contextPackingInstructions],
      ...history,
      ["user", "Please summarize the above conversation history to reduce its size while retaining important information."]
    ];

    // Generate summary using context packing model
    console.log(" Generating summary with context packing model...");
    let summary = '';
    const summaryStream = await contextPackingModel.stream(contextPackingMessages);
    for await (const chunk of summaryStream) {
      summary += chunk.content;
      process.stdout.write('\x1b[32m' + chunk.content + '\x1b[0m');
    }
    console.log();

    // Calculate compressed size
    const compressedSize = Math.ceil(summary.length / 4);
    const reductionPercentage = ((originalHistorySize - compressedSize) / originalHistorySize * 100).toFixed(2);

    console.log(` History compressed: ${originalHistorySize} tokens → ${compressedSize} tokens (${reductionPercentage}% reduction)`);

    // Replace all history with the summary
    conversationMemory.set("default-session-id", [["assistant", summary]]);

    estimatedTokenCount = compressedSize

    // Rebuild messages with compressed history
    messages = [
      ["assistant", summary],
      ["system", systemInstructions],
      ["system", knowledgeBase],
      ["user", userMessage]
    ];
  }

You’ll find the complete version of the code here: index.js

All that’s left is to test our assistant and have it hold a long conversation, to see context packing in action.

docker compose up --build -d
docker compose exec golang-expert-v3 node index.js

And after a while in the conversation, you should see the warning message about the token limit, followed by the summary generated by the context packing model, and finally, the reduction in the number of tokens in the history:

Estimated token count for messages: 5984 tokens
Warning: Estimated token count (5984) exceeds the model's context limit (5120). Compressing conversation history...
Generating summary with context packing model...
Sure, here's a summary of the conversation:

1. The user asked for an example in Go of creating an HTTP server.
2. The assistant provided a simple example in Go that creates an HTTP server and handles GET requests to display "Hello, World!".
3. The user requested an equivalent example in Java.
4. The assistant presented a Java implementation that uses the `java.net.http` package to create an HTTP server and handle incoming requests.

The conversation focused on providing examples of creating HTTP servers in both Go and Java, with the goal of reducing the token count while retaining essential information.
History compressed: 4886 tokens → 153 tokens (96.87% reduction)

This way, we ensure that our assistant can handle a long conversation while maintaining good generation performance.

Summary

The context window is an unavoidable constraint when working with local language models, particularly with small models and on machines with limited resources. However, by using techniques like context packing, you can easily work around this limitation. Using Docker Model Runner and Agentic Compose, you can implement this pattern to support long, verbose conversations without overwhelming your model.

All the source code is available on Codeberg: context-packing. Give it a try! 

Related Posts