Optimizing LangGraph Agents: Cutting Latency and Cost Without Sacrificing Reasoning Quality
LangGraph makes it easy to build agents that are correct. It's much less obvious how to make them fast — and at scale, slow is expensive. The good news: in most agents, the latency isn't where teams assume it is, and the biggest wins come from architecture, not from compromising the model.
This guide walks through five patterns for reducing latency and cost in a LangGraph agent while keeping reasoning quality intact. Applied together, they can roughly halve end-to-end latency and meaningfully cut cost — without moving the answer-quality needle.
The real source of latency
The instinct is to blame the model. "The LLM is slow." Sometimes true. Usually not the whole story.
The first step is always to instrument every node in the graph and log per-node duration. A representative profile for a six-node agent — document Q&A with routing and validation — often looks like this:
textnode: classify_intent → 820ms (LLM) node: route_decision → 910ms (LLM) node: should_retrieve → 760ms (LLM) node: retrieve → 240ms node: generate_answer → 1900ms (LLM) node: validate_output → 680ms (LLM) ───────────────────────────────────── total → 5310ms
The pattern is striking. Four of the six nodes are LLM calls, and three of them are just making decisions — classify, route, should-we-retrieve. Each is a full network round trip to the model, ~800ms each, ~2.5s combined.
The actual reasoning — generate_answer — is only 1.9s. In a profile like this, more time goes into deciding what to do than into doing it.
The expensive part of an agent is rarely the thinking. It's asking a 70B-parameter model to make decisions a plain
ifstatement could make.
Pattern 1: Move routing out of the LLM
This is typically the biggest win. Most "routing" decisions in an agent graph are not fuzzy judgment calls — they are deterministic functions of state that's already available.
A common anti-pattern looks like this:
python# Anti-pattern: ask the LLM which path to take def route_decision(state: AgentState) -> str: prompt = f"Given this query, should we use tools or answer directly?\n{state['query']}" response = llm.invoke(prompt) # ~900ms, every single request return parse_route(response.content) graph.add_node("route_decision", route_decision)
LangGraph already provides conditional edges — routing that runs as plain Python, with zero model calls:
python# Deterministic routing: 0ms of LLM time def route(state: AgentState) -> str: if state["intent"] == "tool_use": return "tools" if state["has_context"]: return "generate" return "retrieve" graph.add_conditional_edges( "classify_intent", route, # a function, not a model {"tools": "run_tools", "generate": "generate_answer", "retrieve": "retrieve"}, )
Intent classification itself — the one decision that genuinely needs a model — should keep an LLM (right-sized and cached; see below). But pure routing on top of a known intent is an if. It was never a reasoning task. Removing this kind of LLM-based control flow commonly recovers ~1.5–2s on its own.
Pattern 2: Run independent work in parallel
Many agent graphs are authored as a straight line even when steps have no dependency on one another. Those steps run sequentially purely as an artifact of how the graph was wired.
A classic example: fetching context from the vector store and pulling user metadata. Neither needs the other.
python# Fan out: both run concurrently, total time = max(a, b), not a + b graph.add_edge("classify_intent", "retrieve_context") graph.add_edge("classify_intent", "fetch_user_meta") # Fan back in: generate waits for both graph.add_edge("retrieve_context", "generate_answer") graph.add_edge("fetch_user_meta", "generate_answer")
LangGraph executes nodes with satisfied dependencies in parallel automatically. Collapsing a + b into max(a, b) on two ~240ms steps is modest in isolation, but it stacks with everything else — and on heavier branches with three or four independent fetches, it's the difference between 1s and 300ms.
Pattern 3: One model size does not fit all nodes
Using the same large model for everything — classification, generation, validation — is a frequent source of avoidable cost and latency. A frontier model deciding "is this a tool query or a chat query?" is a sledgehammer set to a thumbtack.
A tiered approach assigns the right model to each job:
pythonfast_llm = ChatModel(model="small-fast-model", temperature=0) # classify, validate deep_llm = ChatModel(model="large-reasoning-model") # generate, plan def classify_intent(state): return {"intent": fast_llm.invoke(CLASSIFY_PROMPT.format(q=state["query"]))} def generate_answer(state): return {"answer": deep_llm.invoke(build_prompt(state))} # reasoning stays on the big model
A small model can classify in ~200ms instead of ~820ms at a fraction of the token cost, and classification accuracy on a well-built eval set typically stays within ~1% of the large model. The reasoning node keeps the big model. The principle: never trade away the part that actually thinks.
Pattern 4: Cache the decisions that repeat
Intent classification for the same query is deterministic, and so is the routing that follows. A small cache keyed on the normalized query removes redundant model calls entirely:
pythonfrom functools import lru_cache @lru_cache(maxsize=2048) def _classify(normalized_query: str) -> str: return fast_llm.invoke(CLASSIFY_PROMPT.format(q=normalized_query)).content def classify_intent(state): key = " ".join(state["query"].lower().split()) # normalize whitespace + case return {"intent": _classify(key)}
On workloads with repeated or near-duplicate queries, the hit rate does real work — turning a share of classification calls into ~0ms lookups. Normalize the key (lowercase, collapsed whitespace) so trivial variants share an entry, and keep caching off the genuinely generative nodes, where identical inputs are rare and a stale answer would hurt.
Pattern 5: Gate the expensive steps
It's common to use an LLM to decide whether to perform a step that is itself cheap. If retrieval costs ~240ms but the LLM deciding whether to retrieve costs ~760ms, you're paying three times the cost of the work just to decide whether to do it.
The fix is to gate on cheap signals first and escalate to the model only when a decision is genuinely ambiguous:
pythondef needs_retrieval(state) -> bool: if state["intent"] in ("chitchat", "tool_use"): return False # obvious, no model needed if len(state["query"].split()) < 3: return False # too thin to ground return True # default to retrieving when unsure
This is the agentic-RAG principle in practice: skip retrieval when it can't help, and don't pay an LLM to tell you that.
What to leave untouched
This is the part most "speed up your agent" advice skips — and it's why so much of that advice quietly degrades quality. Some things should be left alone on purpose:
- Keep the large model on the reasoning node. Generation is where correctness lives; it should never be swapped down to save milliseconds.
- Keep the self-healing retry loop. When a tool call fails, the agent needs to recover. That loop occasionally costs an extra round trip — reliability is not where to cut.
- Keep output validation — but it can run on the fast model, because validation is pattern-matching, not deep reasoning.
The governing rule: cut control-flow overhead, never cut reasoning depth. Every millisecond worth removing comes from decisions a computer can make without a language model — not from the model thinking less.
The results
Applied together, these patterns transform the representative profile above:
| Node | Before | After | How |
|---|---|---|---|
classify_intent | 820ms | 200ms | small model + cache |
route_decision | 910ms | 0ms | conditional edge |
should_retrieve | 760ms | 5ms | cheap gate |
retrieve / meta | 240ms | 240ms | now parallel |
generate_answer | 1900ms | 1900ms | untouched — the reasoning |
validate_output | 680ms | 210ms | small model |
| Total | 5310ms | 2555ms | −52% |
That's roughly 52% faster and ~40% cheaper, with answer-quality evals moving by less than a point.
The takeaway
If a LangGraph agent feels slow, profile it before blaming the model. In most cases the latency lives in decision nodes, not reasoning nodes — and decisions are the cheapest thing in the world to move out of an LLM and into plain code.
The mental model worth keeping:
An LLM should reason. It should not be your router, your feature flag, or your
ifstatement.
Move control flow into the graph where it belongs. Keep the model for the one thing it's irreplaceable at — thinking — and let it do that on the full-size model, every time.