Tooling ≠ Glue: Why changing AI workflows still feels like duct tape

Posted Aug 11, 2025

Gerardo López Falcón

There’s a weird contradiction in modern AI development. We have better tools than ever. We’re building smarter systems with cleaner abstractions. And yet, every time you try to swap out a component in your stack, things fall apart. Again.

This isn’t just an inconvenience. It’s become the norm.

You’d think with all the frameworks and libraries out there (LangChain, Hugging Face, MLflow, Airflow) we’d be past this by now. These tools were supposed to make our workflows modular and composable. Swap an embedding model? No problem. Try a new vector store? Easy. Switch from OpenAI to an open-source LLM? Go ahead. That was the dream.

But here’s the reality: we’ve traded monoliths for a brittle patchwork of microtools, each with its own assumptions, quirks, and “standard interfaces.” And every time you replace one piece, you end up chasing down broken configs, mismatched input/output formats, and buried side effects in some YAML file you forgot existed.

Tooling was supposed to be the glue. But most days, it still feels like duct tape.

The composability myth

A lot of the tooling that’s emerged in AI came with solid intentions. Follow the UNIX philosophy. Build small pieces that do one thing well. Expose clear interfaces. Make everything swappable.

In theory, this should’ve made experimentation faster and integration smoother. But in practice, most tools were built in isolation. Everyone had their own take on what an embedding is, how prompts should be formatted, what retry logic should look like, or how to chunk a document.

So instead of composability, we got fragmentation. Instead of plug-and-play, we got “glue-and-hope-it-doesn’t-break.”

And this fragmentation isn’t just annoying; it slows everything down. Want to try a new RAG strategy? You might need to re-index your data, adjust your chunk sizes, tweak your scoring functions, and retrain your vector DB schema. None of that should be necessary. But it is.

The stack is shallow and wide

AI pipelines today span a bunch of layers:

Data ingestion
Feature extraction or embeddings
Vector storage and retrieval
LLM inference
Orchestration (LangChain, LlamaIndex, etc.)
Agent logic or RAG strategies
API / frontend layers

Each one looks like a clean block on a diagram. But under the hood, they’re often tightly coupled through undocumented assumptions about tokenization quirks, statefulness, retry behavior, latency expectations, etc.

The result? What should be a flexible stack is more like a house of cards. Change one component, and the whole thing can wobble.

Why everything keeps breaking

The short answer: abstractions leak — a lot.

Every abstraction simplifies something. And when that simplification doesn’t match the underlying complexity, weird things start to happen.

Take LLMs, for example. You might start with OpenAI’s API and everything just works. Predictable latency, consistent token limits, clean error handling. Then you switch to a local model. Suddenly:

The input format is different
You have to manage batching and GPU memory
Token limits aren’t well documented
Latency increases dramatically
You’re now in charge of quantization and caching

What was once a simple llm.predict() call becomes a whole new engineering problem. The abstraction has leaked, and you’re writing glue code again.

This isn’t just a one-off annoyance. It’s structural. We’re trying to standardize a landscape where variability is the rule, not the exception.

Where are the standards?

One big reason for the current mess is the lack of solid standards for interoperability.

In other fields, we’ve figured this out:

Containers → OCI, Docker
APIs → OpenAPI
Observability → OpenTelemetry
Data formats → Parquet, JSON Schema, Avro

In AI? We’re not there yet. Most tools define their own contracts. Few agree on what’s universal. And as a result, reuse is hard, swapping is risky, and scaling becomes painful.

But in AI tooling?

There’s still no widely adopted standard for model I/O signatures.
Prompt formats, context windows, and tokenizer behavior vary across providers.
We do see promising efforts like MCP (Model Context Protocol) emerging, and that’s a good sign, but in practice, most RAG pipelines, agent tools, and vector store integrations still lack consistent, enforced contracts.
Error handling? It’s mostly improvised: retries, timeouts, fallbacks, and silent failures become your responsibility.

So yes, standards like MCP are starting to show up, and they matter. But today, most teams are still stitching things together manually. Until these protocols become part of the common tooling stack, supported by vendors and respected across libraries, the glue will keep leaking.

Local glue ≠ global composability

It’s tempting to say: “But it worked in the notebook.”

Yes, and that’s the problem.

The glue logic that works for your demo, local prototype, or proof-of-concept often breaks down in production. Why?

Notebooks aren’t production environments—they don’t have retries, monitoring, observability, or proper error surfaces.
Chaining tools with Python functions is different from composing them with real-time latency constraints, concurrency, and scale in mind.
Tools like LangChain often make it easy to compose components, until you hit race conditions, cascading failures, or subtle bugs in state management.

Much of today’s tooling is optimized for developer ergonomics during experimentation, not for durability in production. The result: we demo pipelines that look clean and modular, but behind the scenes are fragile webs of assumptions and implicit coupling.

Scaling this glue logic, making it testable, observable, and robust, requires more than clever wrappers. It requires system design, standards, and real engineering discipline.

The core problem: Illusion of modularity

What makes this even more dangerous is the illusion of modularity. On the surface, everything looks composable – API blocks, chain templates, toolkits – but the actual implementations are tightly coupled, poorly versioned, and frequently undocumented.

The AI stack doesn’t break because developers are careless. It breaks because the foundational abstractions are still immature, and the ecosystem hasn’t aligned on how to communicate, fail gracefully, or evolve in sync.

Until we address this, the glue will keep breaking, no matter how shiny the tools become.

Interface contracts, not SDK hype

Many AI tools offer SDKs filled with helper functions and syntactic sugar. But this often hides the actual interfaces and creates tight coupling between your code and a specific tool. Instead, composability means exposing formal interface contracts, like:

OpenAPI for REST APIs
Protocol Buffers for efficient, structured messaging
JSON Schema for validating data structures

These contracts:

Allow clear expectations for inputs/outputs.
Enable automated validation, code generation, and testing.
Make it easier to swap out models/tools without rewriting your code.
Encourage tool-agnostic architecture rather than SDK lock-in.

Build for failure, not just happy paths

Most current AI systems assume everything works smoothly (“happy path”). But in reality:

Models time out
APIs return vague errors
Outputs may be malformed or unsafe

A truly composable system should:

Provide explicit error types (e.g., RateLimitError, ModelTimeout, ValidationFailed)
Expose retry and fallback mechanisms natively (not hand-rolled)
Offer built-in observability—metrics, logs, traces
Make failure handling declarative and modular (e.g., try model B if model A fails)

Shift toward declarative pipelines

Today, most AI workflows are written in procedural code:

response = model.generate(prompt)
if response.score > 0.8:
    store(response)

But this logic is hard to:

Reuse across tools
Observe or debug
Cache intermediate results

A declarative pipeline describes the what, not the how:

pipeline:
  - step: generate
    model: gpt-4
    input: ${user_input}
  - step: filter
    condition: score > 0.8
  - step: store
    target: vector_database

Benefits of declarative pipelines:

Easier to optimize and cache
Tool-agnostic, works across providers
More maintainable and easier to reason about
Supports dynamic reconfiguration instead of rewrites

Key takeaways for developers

1. Be skeptical of “seamless” tools without contracts

Be skeptical of tools that promise seamless plug-and-play but lack strong interface contracts.

If a tool markets itself as easy to integrate but doesn’t offer:

A clear interface contract (OpenAPI, Protobuf, JSON schema)
Versioned APIs
Validation rules for input/output
Language-agnostic interfaces

Then the “plug-and-play” claim is misleading. These tools often lock you into an SDK and hide the true cost of integration.

2. Design defensively

Design your workflows defensively: isolate components, standardize formats, and expect things to break.

Good system design assumes things will fail.

Isolate responsibilities: e.g., don’t mix prompting, retrieval, and evaluation in one block of code.
Standardize formats: Use common schemas across tools (e.g., JSON-LD, shared metadata, or LangChain-style message objects).
Handle failures: Build with fallbacks, timeouts, retries, and observability from the start.

Tip: Treat every tool like an unreliable network service, even if it’s running locally.

3. Prefer declarative, interoperable pipelines

Embrace declarative and interoperable approaches: less code, more structure.

Declarative tools (e.g., YAML workflows, JSON pipelines) offer:

Clarity: You describe what should happen, not how.
Modularity: You can replace steps without rewriting everything.
Tool-neutrality: Works across providers or frameworks.

This is the difference between wiring by hand and using a circuit board. Declarative systems give you predictable interfaces and reusable components.

Examples:

LangGraph
Flowise
PromptLayer + OpenAPI specs
Tools that use JSON as input/output with clear schemas

Conclusion

We’ve all seen what’s possible: modular pipelines, reusable components, and AI systems that don’t break every time you swap a model or change a backend. But let’s be honest, we’re not there yet. And we won’t get there just by waiting for someone else to fix it. If we want a future where AI workflows are truly composable, it’s on us, the people building and maintaining these systems, to push things forward.

That doesn’t mean reinventing everything. It means starting with what we already control: write clearer contracts, document your internal pipelines like someone else will use them (because someone will), choose tools that embrace interoperability, and speak up when things are too tightly coupled. The tooling landscape doesn’t change overnight, but with every decision we make, every PR we open, and every story we share, we move one step closer to infrastructure that’s built to last, not just duct-taped together.

About the Authors

Gerardo López Falcón Principal Engineer, Veritas Automata

As a Principal Engineer, Gerardo leads the migration and innovation of projects to the cloud, leveraging Google Cloud Platform and AWS.

AI/ML LLM Community

Tooling ≠ Glue: Why changing AI workflows still feels like duct tape

The composability myth

The stack is shallow and wide

Why everything keeps breaking

Where are the standards?

Local glue ≠ global composability

The core problem: Illusion of modularity

Interface contracts, not SDK hype

Build for failure, not just happy paths

Shift toward declarative pipelines

Key takeaways for developers

1. Be skeptical of “seamless” tools without contracts

2. Design defensively

3. Prefer declarative, interoperable pipelines

Conclusion

About the Authors

Related Posts

Dynamic MCPs with Docker: Stop Hardcoding Your Agents’ World

Docker + E2B: Building the Future of Trusted AI

100% Transparency and Five Pillars

The Rising Importance of Governance at SwampUP Berlin 2025

How Docker Hardened Images Patches Vulnerabilities in 24 hours

Beyond the Hype: How to Use AI to Actually Increase Your Productivity as a Dev

Products

Features

Developers

Pricing

Company

Languages