There’s a weird contradiction in modern AI development. We have better tools than ever. We’re building smarter systems with cleaner abstractions. And yet, every time you try to swap out a component in your stack, things fall apart. Again.
This isn’t just an inconvenience. It’s become the norm.
You’d think with all the frameworks and libraries out there (LangChain, Hugging Face, MLflow, Airflow) we’d be past this by now. These tools were supposed to make our workflows modular and composable. Swap an embedding model? No problem. Try a new vector store? Easy. Switch from OpenAI to an open-source LLM? Go ahead. That was the dream.
But here’s the reality: we’ve traded monoliths for a brittle patchwork of microtools, each with its own assumptions, quirks, and “standard interfaces.” And every time you replace one piece, you end up chasing down broken configs, mismatched input/output formats, and buried side effects in some YAML file you forgot existed.
Tooling was supposed to be the glue. But most days, it still feels like duct tape.
The composability myth
A lot of the tooling that’s emerged in AI came with solid intentions. Follow the UNIX philosophy. Build small pieces that do one thing well. Expose clear interfaces. Make everything swappable.
In theory, this should’ve made experimentation faster and integration smoother. But in practice, most tools were built in isolation. Everyone had their own take on what an embedding is, how prompts should be formatted, what retry logic should look like, or how to chunk a document.
So instead of composability, we got fragmentation. Instead of plug-and-play, we got “glue-and-hope-it-doesn’t-break.”
And this fragmentation isn’t just annoying; it slows everything down. Want to try a new RAG strategy? You might need to re-index your data, adjust your chunk sizes, tweak your scoring functions, and retrain your vector DB schema. None of that should be necessary. But it is.
The stack is shallow and wide
AI pipelines today span a bunch of layers:
- Data ingestion
- Feature extraction or embeddings
- Vector storage and retrieval
- LLM inference
- Orchestration (LangChain, LlamaIndex, etc.)
- Agent logic or RAG strategies
- API / frontend layers
Each one looks like a clean block on a diagram. But under the hood, they’re often tightly coupled through undocumented assumptions about tokenization quirks, statefulness, retry behavior, latency expectations, etc.
The result? What should be a flexible stack is more like a house of cards. Change one component, and the whole thing can wobble.
Why everything keeps breaking
The short answer: abstractions leak — a lot.
Every abstraction simplifies something. And when that simplification doesn’t match the underlying complexity, weird things start to happen.
Take LLMs, for example. You might start with OpenAI’s API and everything just works. Predictable latency, consistent token limits, clean error handling. Then you switch to a local model. Suddenly:
- The input format is different
- You have to manage batching and GPU memory
- Token limits aren’t well documented
- Latency increases dramatically
- You’re now in charge of quantization and caching
What was once a simple llm.predict()
call becomes a whole new engineering problem. The abstraction has leaked, and you’re writing glue code again.
This isn’t just a one-off annoyance. It’s structural. We’re trying to standardize a landscape where variability is the rule, not the exception.
Where are the standards?
One big reason for the current mess is the lack of solid standards for interoperability.
In other fields, we’ve figured this out:
- Containers → OCI, Docker
- APIs → OpenAPI
- Observability → OpenTelemetry
- Data formats → Parquet, JSON Schema, Avro
In AI? We’re not there yet. Most tools define their own contracts. Few agree on what’s universal. And as a result, reuse is hard, swapping is risky, and scaling becomes painful.
But in AI tooling?
- There’s still no widely adopted standard for model I/O signatures.
- Prompt formats, context windows, and tokenizer behavior vary across providers.
- We do see promising efforts like MCP (Model Context Protocol) emerging, and that’s a good sign, but in practice, most RAG pipelines, agent tools, and vector store integrations still lack consistent, enforced contracts.
- Error handling? It’s mostly improvised: retries, timeouts, fallbacks, and silent failures become your responsibility.
So yes, standards like MCP are starting to show up, and they matter. But today, most teams are still stitching things together manually. Until these protocols become part of the common tooling stack, supported by vendors and respected across libraries, the glue will keep leaking.
Local glue ≠ global composability
It’s tempting to say: “But it worked in the notebook.”
Yes, and that’s the problem.
The glue logic that works for your demo, local prototype, or proof-of-concept often breaks down in production. Why?
- Notebooks aren’t production environments—they don’t have retries, monitoring, observability, or proper error surfaces.
- Chaining tools with Python functions is different from composing them with real-time latency constraints, concurrency, and scale in mind.
- Tools like LangChain often make it easy to compose components, until you hit race conditions, cascading failures, or subtle bugs in state management.
Much of today’s tooling is optimized for developer ergonomics during experimentation, not for durability in production. The result: we demo pipelines that look clean and modular, but behind the scenes are fragile webs of assumptions and implicit coupling.
Scaling this glue logic, making it testable, observable, and robust, requires more than clever wrappers. It requires system design, standards, and real engineering discipline.
The core problem: Illusion of modularity
What makes this even more dangerous is the illusion of modularity. On the surface, everything looks composable – API blocks, chain templates, toolkits – but the actual implementations are tightly coupled, poorly versioned, and frequently undocumented.
The AI stack doesn’t break because developers are careless. It breaks because the foundational abstractions are still immature, and the ecosystem hasn’t aligned on how to communicate, fail gracefully, or evolve in sync.
Until we address this, the glue will keep breaking, no matter how shiny the tools become.
Interface contracts, not SDK hype
Many AI tools offer SDKs filled with helper functions and syntactic sugar. But this often hides the actual interfaces and creates tight coupling between your code and a specific tool. Instead, composability means exposing formal interface contracts, like:
- OpenAPI for REST APIs
- Protocol Buffers for efficient, structured messaging
- JSON Schema for validating data structures
These contracts:
- Allow clear expectations for inputs/outputs.
- Enable automated validation, code generation, and testing.
- Make it easier to swap out models/tools without rewriting your code.
- Encourage tool-agnostic architecture rather than SDK lock-in.
Build for failure, not just happy paths
Most current AI systems assume everything works smoothly (“happy path”). But in reality:
- Models time out
- APIs return vague errors
- Outputs may be malformed or unsafe
A truly composable system should:
- Provide explicit error types (e.g.,
RateLimitError
,ModelTimeout
,ValidationFailed
) - Expose retry and fallback mechanisms natively (not hand-rolled)
- Offer built-in observability—metrics, logs, traces
- Make failure handling declarative and modular (e.g., try model B if model A fails)
Shift toward declarative pipelines
Today, most AI workflows are written in procedural code:
response = model.generate(prompt)
if response.score > 0.8:
store(response)
But this logic is hard to:
- Reuse across tools
Observe or debug - Cache intermediate results
A declarative pipeline describes the what, not the how:
pipeline:
- step: generate
model: gpt-4
input: ${user_input}
- step: filter
condition: score > 0.8
- step: store
target: vector_database
Benefits of declarative pipelines:
- Easier to optimize and cache
- Tool-agnostic, works across providers
- More maintainable and easier to reason about
- Supports dynamic reconfiguration instead of rewrites
Key takeaways for developers
1. Be skeptical of “seamless” tools without contracts
Be skeptical of tools that promise seamless plug-and-play but lack strong interface contracts.
If a tool markets itself as easy to integrate but doesn’t offer:
- A clear interface contract (OpenAPI, Protobuf, JSON schema)
- Versioned APIs
- Validation rules for input/output
- Language-agnostic interfaces
Then the “plug-and-play” claim is misleading. These tools often lock you into an SDK and hide the true cost of integration.
2. Design defensively
Design your workflows defensively: isolate components, standardize formats, and expect things to break.
Good system design assumes things will fail.
- Isolate responsibilities: e.g., don’t mix prompting, retrieval, and evaluation in one block of code.
- Standardize formats: Use common schemas across tools (e.g., JSON-LD, shared metadata, or LangChain-style message objects).
- Handle failures: Build with fallbacks, timeouts, retries, and observability from the start.
Tip: Treat every tool like an unreliable network service, even if it’s running locally.
3. Prefer declarative, interoperable pipelines
Embrace declarative and interoperable approaches: less code, more structure.
Declarative tools (e.g., YAML workflows, JSON pipelines) offer:
- Clarity: You describe what should happen, not how.
- Modularity: You can replace steps without rewriting everything.
- Tool-neutrality: Works across providers or frameworks.
This is the difference between wiring by hand and using a circuit board. Declarative systems give you predictable interfaces and reusable components.
Examples:
- LangGraph
- Flowise
- PromptLayer + OpenAPI specs
- Tools that use JSON as input/output with clear schemas
Conclusion
We’ve all seen what’s possible: modular pipelines, reusable components, and AI systems that don’t break every time you swap a model or change a backend. But let’s be honest, we’re not there yet. And we won’t get there just by waiting for someone else to fix it. If we want a future where AI workflows are truly composable, it’s on us, the people building and maintaining these systems, to push things forward.
That doesn’t mean reinventing everything. It means starting with what we already control: write clearer contracts, document your internal pipelines like someone else will use them (because someone will), choose tools that embrace interoperability, and speak up when things are too tightly coupled. The tooling landscape doesn’t change overnight, but with every decision we make, every PR we open, and every story we share, we move one step closer to infrastructure that’s built to last, not just duct-taped together.