Building Multi-Agent Systems for Production: What They Don't Tell You

Every multi-agent tutorial ends at the same place: the demo works. The agents coordinate. The research agent hands off to the writing agent. The output looks right. You push to main.

Then production happens.

The agent spends $40 on a single malformed request. The supervisor loops 200 times because the sub-agent's response doesn't match the expected format. Two agents update the same state simultaneously. The system produces confident-sounding wrong answers with no indication that anything went wrong. You don't find out for three days because you have no observability.

This is the guide for what comes after the tutorial. The hard problems in multi-agent production systems, what causes them, and the patterns that address them.

The Coordination Problem

Multi-agent systems fail to coordinate in predictable ways. Understanding the failure modes is the first step to preventing them.

State sharing conflicts. Multiple agents reading and writing shared state concurrently is the classic distributed systems problem, applied to agents. LangGraph's state model uses reducers (functions that define how to merge state updates from multiple nodes) to handle this. If you're not thinking about reducers, you'll hit silent data races.

from typing import Annotated
import operator

class MultiAgentState(TypedDict):
    # Wrong: last writer wins, earlier findings silently discarded
    findings: list[str]

    # Right: reducer accumulates findings from all agents
    findings: Annotated[list[str], operator.add]

    # Wrong: parallel agents overwrite each other's message
    last_message: str

    # Right: define merge behavior explicitly
    messages: Annotated[list[dict], operator.add]

The Annotated type with a reducer is how LangGraph handles parallel agent updates. Without it, you're relying on execution order for correctness, which you don't control.

Handoff ambiguity. Agents hand off tasks to each other via the supervisor routing logic. The handoff fails silently when the sub-agent's output doesn't clearly signal completion vs. needing more work. Define explicit completion signals.

from pydantic import BaseModel
from enum import Enum

class AgentStatus(str, Enum):
    COMPLETE = "complete"
    NEEDS_REVIEW = "needs_review"
    FAILED = "failed"
    NEEDS_MORE_CONTEXT = "needs_more_context"

class AgentResult(BaseModel):
    status: AgentStatus
    output: str
    confidence: float  # 0.0 to 1.0
    reason: str  # Why this status

# Force structured output in your sub-agents
def research_agent(state: State) -> dict:
    result = llm.with_structured_output(AgentResult).invoke(
        build_research_prompt(state)
    )
    return {"research_result": result, "status": result.status}

When every agent returns a typed result with explicit status, your routing logic becomes reliable. When agents return unstructured text that the supervisor has to interpret, routing becomes a guessing game.

Supervisor loop traps. The supervisor routes to agent A. Agent A returns with needs_more_context. Supervisor routes back to agent A (because it's the relevant agent). Agent A returns needs_more_context again. You've just built an infinite loop.

Every supervisor must have explicit termination conditions that don't depend on sub-agent cooperation:

MAX_ITERATIONS = 15

def supervisor_route(state: State) -> str:
    if state["iteration_count"] >= MAX_ITERATIONS:
        return "force_complete"  # Hard stop, generate best-effort output

    if state["status"] == AgentStatus.COMPLETE:
        return END

    if state["status"] == AgentStatus.FAILED:
        if state["retry_count"] >= 3:
            return "error_handler"
        return "retry"

    return route_by_task_type(state)

Hard limits on iterations, retries, and time are not optional features. They are the minimum requirement for production safety.

The Cost Problem

LLM costs are the budget line that surprises everyone who hasn't run a multi-agent system at scale.

The math is unforgiving. A single agent run that involves a supervisor plus three sub-agents, each with a few tool calls, can easily consume 50,000 tokens. At $3/M input tokens (GPT-4o in 2026), that's $0.15 per run. A thousand runs a day: $150/day, $4,500/month, just for that one workflow.

Multi-agent systems multiply costs in non-obvious ways:

Context accumulation. Each sub-agent receives the full conversation history plus its task context. As the conversation grows, costs grow proportionally. An agent on step 10 of a multi-step workflow carries the weight of all previous steps.

Redundant reasoning. Multiple agents independently reasoning about the same underlying information. The researcher summarizes. The writer reads the summary. The reviewer reads the summary again. If the summary is 2,000 tokens and three agents process it, that's 6,000 tokens before anyone does any new work.

Tool call inflation. Agents that aren't constrained on tool calls will use them exploratorily. "Let me check this one more time to be sure." Every exploratory tool call costs context tokens for the tool call, the result, and all subsequent steps that carry it.

The mitigation patterns:

# Track costs in agent state
class ProductionState(TypedDict):
    messages: Annotated[list[dict], operator.add]
    total_tokens_used: int
    estimated_cost_usd: float
    token_budget: int  # Set per workflow based on task complexity

def check_budget(state: ProductionState) -> str:
    if state["total_tokens_used"] >= state["token_budget"]:
        return "over_budget"
    return "continue"

# Use cheaper models for sub-agents that don't need full capability
def triage_agent(state: State) -> dict:
    # Use a fast, cheap model for classification
    response = cheap_llm.invoke(build_triage_prompt(state))
    return {"task_type": response.task_type}

def synthesis_agent(state: State) -> dict:
    # Use a capable model only for the final synthesis step
    response = capable_llm.invoke(build_synthesis_prompt(state))
    return {"final_output": response.content}

Model routing — using cheaper models for classification and routing, expensive models only for the steps that genuinely need their capability — is the highest-leverage cost optimization. A triage agent that decides which specialist to call doesn't need GPT-4o. It needs a reliable classifier.

The Failure Mode Taxonomy

Multi-agent systems fail in four categories. Knowing which failure mode you're dealing with determines the mitigation.

Hard failures. API errors, timeouts, network failures. The tool didn't respond. The LLM call threw an exception. These are detectable and handleable with standard retry logic.

import tenacity

@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_exponential(multiplier=1, min=4, max=10),
    retry=tenacity.retry_if_exception_type((TimeoutError, RateLimitError))
)
async def call_llm_with_retry(messages: list) -> str:
    return await llm.ainvoke(messages)

Soft failures. The LLM responded, but the response is wrong in ways your code doesn't detect. Incorrect reasoning, false information presented confidently, misunderstood instructions. These are the dangerous failures. They pass all your assertions because your assertions check for valid format, not correct content.

The only mitigation for soft failures is evaluation. Not unit tests — evaluation. Does the agent produce correct outputs on representative test cases? This requires building evaluation infrastructure before you have production failures.

Cascade failures. Agent A produces a slightly wrong output. Agent B takes that output as ground truth and builds on it. Agent C builds on B's compounded error. By the time the supervisor sees the result, the error is deeply embedded and the output looks plausible. Multi-agent systems amplify errors across hops.

Mitigation: validate outputs at each hop, not just at the end. If agent B receives an input it considers suspicious, it should flag it rather than silently continuing.

Emergent failures. The multi-agent system as a whole produces outputs that none of the individual agents would produce alone — and not in a good way. Two agents with slightly different information produce contradictory conclusions, and the synthesis agent averages them into confident nonsense.

Emergent failures are the hardest to anticipate. The best defense is extensive integration testing with real representative inputs before production deployment.

Observability: What You Actually Need

An unobserved multi-agent system is a liability. You need to know what happened at every step when something goes wrong.

Trace every agent invocation. Every agent call needs: a trace ID that links the whole multi-agent run together, which agent was called, what inputs it received, what outputs it produced, token usage, latency, and whether it succeeded.

import uuid
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class AgentSpan:
    trace_id: str
    span_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    agent_name: str = ""
    started_at: datetime = field(default_factory=datetime.utcnow)
    ended_at: datetime | None = None
    input_tokens: int = 0
    output_tokens: int = 0
    status: str = "running"
    error: str | None = None
    metadata: dict = field(default_factory=dict)

    def complete(self, status: str = "success", error: str | None = None):
        self.ended_at = datetime.utcnow()
        self.status = status
        self.error = error

# Wrap your agent nodes
def traced_agent(agent_fn, agent_name: str):
    def wrapper(state: State) -> dict:
        span = AgentSpan(
            trace_id=state["trace_id"],
            agent_name=agent_name
        )
        try:
            result = agent_fn(state)
            span.complete("success")
            return result
        except Exception as e:
            span.complete("error", str(e))
            raise
        finally:
            emit_span(span)  # Send to your observability backend
    return wrapper

LangSmith for LangGraph systems. If you're using LangGraph, LangSmith (LangChain's observability platform) provides native tracing. You can see the full graph execution — which nodes ran, what state looked like at each step, where time was spent. Set LANGCHAIN_TRACING_V2=true and it works without code changes. Free tier covers most development use.

Alert on what matters. Not every metric is worth alerting on. The metrics that predict production problems:

Average token usage per workflow run (sudden increases signal runaway loops)
P95 latency (latency spikes correlate with LLM issues or infinite loops)
Error rate by agent (which specific agent fails most)
Cost per run over time (drift indicates behavioral changes)

The Debugging Workflow

When a multi-agent run produces wrong output, this is the sequence:

Get the trace ID. Every production run should emit a trace ID you can look up. Without this, debugging is archaeology.
Find where the error entered the system. Was agent A's output already wrong, or did it go wrong in the handoff to agent B? Walk the trace backward from the wrong output.
Isolate the failing agent. Extract the exact inputs the failing agent received and run it in isolation. Remove all the multi-agent complexity.
Is it a prompt problem or a reasoning problem? Prompt problem: the agent doesn't understand what it's supposed to do. Fix the system prompt. Reasoning problem: the agent understands but produces wrong outputs on certain inputs. Add few-shot examples, or add a validation step.
Add a regression test. Before shipping the fix, add the failing input to your evaluation suite. Don't let the same input fail twice.

Phase 6 of MindloomHQ's Agentic AI course covers multi-agent systems in depth — supervisor architectures, coordination patterns, production failure handling, and observability for real deployed systems. Explore the curriculum →

Every multi-agent tutorial ends at the same place: the demo works. The agents coordinate. The research agent hands off to the writing agent. The output looks right. You push to main.

Then production happens.

This is the guide for what comes after the tutorial. The hard problems in multi-agent production systems, what causes them, and the patterns that address them.

The Coordination Problem

Multi-agent systems fail to coordinate in predictable ways. Understanding the failure modes is the first step to preventing them.

from typing import Annotated
import operator

class MultiAgentState(TypedDict):
    # Wrong: last writer wins, earlier findings silently discarded
    findings: list[str]

    # Right: reducer accumulates findings from all agents
    findings: Annotated[list[str], operator.add]

    # Wrong: parallel agents overwrite each other's message
    last_message: str

    # Right: define merge behavior explicitly
    messages: Annotated[list[dict], operator.add]

The Annotated type with a reducer is how LangGraph handles parallel agent updates. Without it, you're relying on execution order for correctness, which you don't control.

from pydantic import BaseModel
from enum import Enum

class AgentStatus(str, Enum):
    COMPLETE = "complete"
    NEEDS_REVIEW = "needs_review"
    FAILED = "failed"
    NEEDS_MORE_CONTEXT = "needs_more_context"

class AgentResult(BaseModel):
    status: AgentStatus
    output: str
    confidence: float  # 0.0 to 1.0
    reason: str  # Why this status

# Force structured output in your sub-agents
def research_agent(state: State) -> dict:
    result = llm.with_structured_output(AgentResult).invoke(
        build_research_prompt(state)
    )
    return {"research_result": result, "status": result.status}

Every supervisor must have explicit termination conditions that don't depend on sub-agent cooperation:

MAX_ITERATIONS = 15

def supervisor_route(state: State) -> str:
    if state["iteration_count"] >= MAX_ITERATIONS:
        return "force_complete"  # Hard stop, generate best-effort output

    if state["status"] == AgentStatus.COMPLETE:
        return END

    if state["status"] == AgentStatus.FAILED:
        if state["retry_count"] >= 3:
            return "error_handler"
        return "retry"

    return route_by_task_type(state)

Hard limits on iterations, retries, and time are not optional features. They are the minimum requirement for production safety.

The Cost Problem

LLM costs are the budget line that surprises everyone who hasn't run a multi-agent system at scale.

Multi-agent systems multiply costs in non-obvious ways:

The mitigation patterns:

# Track costs in agent state
class ProductionState(TypedDict):
    messages: Annotated[list[dict], operator.add]
    total_tokens_used: int
    estimated_cost_usd: float
    token_budget: int  # Set per workflow based on task complexity

def check_budget(state: ProductionState) -> str:
    if state["total_tokens_used"] >= state["token_budget"]:
        return "over_budget"
    return "continue"

# Use cheaper models for sub-agents that don't need full capability
def triage_agent(state: State) -> dict:
    # Use a fast, cheap model for classification
    response = cheap_llm.invoke(build_triage_prompt(state))
    return {"task_type": response.task_type}

def synthesis_agent(state: State) -> dict:
    # Use a capable model only for the final synthesis step
    response = capable_llm.invoke(build_synthesis_prompt(state))
    return {"final_output": response.content}

The Failure Mode Taxonomy

Multi-agent systems fail in four categories. Knowing which failure mode you're dealing with determines the mitigation.

Hard failures. API errors, timeouts, network failures. The tool didn't respond. The LLM call threw an exception. These are detectable and handleable with standard retry logic.

import tenacity

@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_exponential(multiplier=1, min=4, max=10),
    retry=tenacity.retry_if_exception_type((TimeoutError, RateLimitError))
)
async def call_llm_with_retry(messages: list) -> str:
    return await llm.ainvoke(messages)

Mitigation: validate outputs at each hop, not just at the end. If agent B receives an input it considers suspicious, it should flag it rather than silently continuing.

Emergent failures are the hardest to anticipate. The best defense is extensive integration testing with real representative inputs before production deployment.

Observability: What You Actually Need

An unobserved multi-agent system is a liability. You need to know what happened at every step when something goes wrong.

import uuid
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class AgentSpan:
    trace_id: str
    span_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    agent_name: str = ""
    started_at: datetime = field(default_factory=datetime.utcnow)
    ended_at: datetime | None = None
    input_tokens: int = 0
    output_tokens: int = 0
    status: str = "running"
    error: str | None = None
    metadata: dict = field(default_factory=dict)

    def complete(self, status: str = "success", error: str | None = None):
        self.ended_at = datetime.utcnow()
        self.status = status
        self.error = error

# Wrap your agent nodes
def traced_agent(agent_fn, agent_name: str):
    def wrapper(state: State) -> dict:
        span = AgentSpan(
            trace_id=state["trace_id"],
            agent_name=agent_name
        )
        try:
            result = agent_fn(state)
            span.complete("success")
            return result
        except Exception as e:
            span.complete("error", str(e))
            raise
        finally:
            emit_span(span)  # Send to your observability backend
    return wrapper

Alert on what matters. Not every metric is worth alerting on. The metrics that predict production problems:

Average token usage per workflow run (sudden increases signal runaway loops)
P95 latency (latency spikes correlate with LLM issues or infinite loops)
Error rate by agent (which specific agent fails most)
Cost per run over time (drift indicates behavioral changes)

The Debugging Workflow

When a multi-agent run produces wrong output, this is the sequence:

Get the trace ID. Every production run should emit a trace ID you can look up. Without this, debugging is archaeology.
Find where the error entered the system. Was agent A's output already wrong, or did it go wrong in the handoff to agent B? Walk the trace backward from the wrong output.
Isolate the failing agent. Extract the exact inputs the failing agent received and run it in isolation. Remove all the multi-agent complexity.
Is it a prompt problem or a reasoning problem? Prompt problem: the agent doesn't understand what it's supposed to do. Fix the system prompt. Reasoning problem: the agent understands but produces wrong outputs on certain inputs. Add few-shot examples, or add a validation step.
Add a regression test. Before shipping the fix, add the failing input to your evaluation suite. Don't let the same input fail twice.

Building Multi-Agent Systems for Production: What They Don't Tell You

The Coordination Problem

The Cost Problem

The Failure Mode Taxonomy

Observability: What You Actually Need

The Debugging Workflow

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts

Building Multi-Agent Systems for Production: What They Don't Tell You

The Coordination Problem

The Cost Problem

The Failure Mode Taxonomy

Observability: What You Actually Need

The Debugging Workflow

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts