The missing third node in LangGraph Plan → Execute → Validate (PEV) workflows — a structured Validator with confidence scoring, per-step retry, and automatic replanning.
I built and deployed multi-agent platforms in production inside a regulated life sciences environment — systems that automate scientific workflows, search tens of millions of research records, and route LLM decisions through quality gates before acting on them.
The standard LangGraph plan-and-execute pattern has two nodes: Plan and Execute. That's fine for demos. In production, we learned quickly that execution quality is not binary — an agent can technically complete a step while producing output that is incomplete, hallucinated, or missing a critical detail. Without a quality gate, those failures propagate silently to the next step.
We added a third node: a Validator that scores every step output (0.0–1.0) against the original intent. Steps below the threshold are retried with the validator's feedback injected into the next attempt. If retries are exhausted, the entire plan is regenerated with the failure context. Nothing moves forward until it passes.
This template packages that pattern. It is extracted from real production agent infrastructure, not assembled from documentation.
| Feature | Standard plan-execute | This template |
|---|---|---|
| Planning node | ✓ | ✓ |
| Execution node with tool calls | ✓ | ✓ |
| Validation + confidence score | ✗ | ✓ 0.0 – 1.0 |
| Per-step retry with feedback injection | ✗ | ✓ configurable |
| Automatic replanning on exhausted retries | ✗ | ✓ with failure context |
| Multi-model cost optimisation | ✗ | ✓ haiku/sonnet split |
| Full audit trail (every attempt) | ✗ | ✓ operator.add accumulator |
| Structured outputs (no string parsing) | ✗ | ✓ Pydantic models |
This project is built for AI Architects and Senior Engineers moving beyond demo-grade agents into reliable, production-ready autonomous systems.
Use this as a foundation for new agents. Define your domain-specific tools in PEVConfig, tune the prompts.py for your use case, and deploy as a microservice or MCP server.
If you have an existing LangGraph project, use this as a reference implementation to add a Validator + Router quality gate to your current Plan-Execute flows.
Use the benchmark_reliability.py pattern to test how different validator_model choices (e.g., Haiku vs. Sonnet) affect your "Correction Rate" and total token spend.
git clone https://github.com/ManjunathGovindaraju/langgraph-plan-execute-validate.git
cd langgraph-plan-execute-validate
uv sync
cp .env.example .env # add your ANTHROPIC_API_KEYFive lines to run:
from pev import create_pev_graph, initial_state, PEVConfig
graph = create_pev_graph(PEVConfig(pass_threshold=0.85))
result = graph.invoke(initial_state("Research the top 3 vector databases"))
print(result["status"]) # "complete"
print(result["step_results"]) # scored audit trail for every stepRun an example:
python examples/research_agent.py # web search + validate
python examples/code_review_agent.py # strict threshold, shows retries
python examples/data_analysis_agent.py # no tools, LLM reasoning onlygraph TD
User(["👤 Caller"])
subgraph PEVGraph["PEV Graph — LangGraph StateGraph"]
direction TB
Planner["🧠 Planner\nclaude-haiku\nStructured JSON output"]
Executor["⚙️ Executor\nclaude-sonnet\nTool-call loop"]
Validator["🔍 Validator\nclaude-haiku\nConfidence score 0–1"]
Router["🔀 Router\nPure Python — no LLM"]
Planner --> Executor
Executor --> Validator
Validator --> Router
end
Tools["🛠️ Tools\n(Tavily, custom, etc.)"]
LangSmith["📊 LangSmith\nObservability"]
User -->|"initial_state(task)"| Planner
Router -->|"score ≥ threshold next step"| Executor
Router -->|"retry_count < max"| Executor
Router -->|"retries exhausted"| Planner
Router -->|"complete / failed"| User
Executor <-->|"tool calls"| Tools
PEVGraph -.->|"traces"| LangSmith
stateDiagram-v2
[*] --> Planning : graph.invoke()
Planning --> Executing : plan generated
Executing --> Validating : step complete
Validating --> Executing : score ≥ threshold · more steps remain
Validating --> Complete : score ≥ threshold · final step
Validating --> Executing : score < threshold · retry_count < max_retries
Validating --> Planning : score < threshold · retry_count ≥ max_retries · replan_count < max_replans
Validating --> Failed : all limits exhausted
Complete --> [*]
Failed --> [*]
flowchart TD
A["Router reads score, retry_count,\nreplan_count, current_step_idx"]
B{"score >= pass_threshold?"}
C{"Last step in plan?"}
D["✅ complete\nroute → END"]
E["➡️ advance idx\nreset retry_count\nroute → executor"]
F{"retry_count < max_retries?"}
G["🔁 retry\nincrement retry_count\nroute → executor"]
H{"replan_count < max_replans?"}
I["🔄 replan\nroute → planner"]
J["❌ failed\nset error message\nroute → END"]
A --> B
B -->|"yes"| C
C -->|"yes"| D
C -->|"no"| E
B -->|"no"| F
F -->|"yes"| G
F -->|"no"| H
H -->|"yes"| I
H -->|"no"| J
sequenceDiagram
actor U as Caller
participant P as Planner (haiku)
participant E as Executor (sonnet)
participant T as Tools
participant V as Validator (haiku)
participant R as Router
U->>P: task
P-->>E: plan [step1, step2, step3]
loop Each step
E->>T: tool_call(args)
T-->>E: result
E-->>V: pending_result
V-->>R: score + feedback
alt score ≥ threshold
R-->>E: next step
else retry available
R-->>E: retry + feedback injected
else retries exhausted
R-->>P: replan + failure context
else all limits hit
R-->>U: status=failed
end
end
R-->>U: status=complete · step_results[]
This platform is designed to work seamlessly with the FastMCP Production Template.
- The Brain (PEV): This repository. It handles high-level reasoning, multi-step planning, and quality-controlled execution.
- The Hands (MCP): Standardized tools, resources, and prompts exposed via FastMCP.
By separating Orchestration (PEV) from Capabilities (MCP), you can update your business logic and data integrations without redeploying your core agent logic.
from langchain_mcp_adapters.tools import load_mcp_tools
from pev import create_pev_graph, PEVConfig
# Load tools from your FastMCP production server
tools = load_mcp_tools("uv", ["run", "path/to/server.py"])
# Orchestrate them with PEV's quality gates
graph = create_pev_graph(PEVConfig(tools=tools, pass_threshold=0.85))from pev import PEVConfig
cfg = PEVConfig(
# Model routing — cheap models for bookkeeping, capable for execution
planner_model = "claude-haiku-4-5-20251001", # structured JSON only
executor_model = "claude-sonnet-4-6", # tool calls + reasoning
validator_model = "claude-haiku-4-5-20251001", # score + one sentence
# Quality gate
pass_threshold = 0.80, # score ≥ this → step passes
# Loop guards
max_retries = 2, # retries per step before escalating to replan
max_replans = 1, # full replanning cycles before marking failed
# Tools available to the executor (planner and validator never see tools)
tools = [TavilySearchResults(max_results=3)],
)| Parameter | Default | Description |
|---|---|---|
planner_model |
claude-haiku-4-5-20251001 |
Model for planning — structured output only |
executor_model |
claude-sonnet-4-6 |
Model for execution — tool calls + reasoning |
validator_model |
claude-haiku-4-5-20251001 |
Model for validation — scoring only |
pass_threshold |
0.80 |
Minimum score [0.0–1.0] for a step to pass |
max_retries |
2 |
Retries per step before triggering replan |
max_replans |
1 |
Full replanning cycles before marking failed |
tools |
[] |
LangChain tools available to the executor |
Every attempt — including retries — is preserved in step_results via operator.add. Nothing is ever overwritten.
result["step_results"]
# [
# StepResult(step="Search for X", score=0.55, attempts=1, feedback="Missing Y"),
# StepResult(step="Search for X", score=0.88, attempts=2, feedback="Good."),
# StepResult(step="Summarise X", score=0.92, attempts=1, feedback="Complete."),
# ]This is the operational signal that matters: you can see exactly where the agent struggled, what feedback it received, and how many attempts each step took.
The three-model design cuts per-run cost by ~60–70% vs routing everything through a capable model.
graph LR
subgraph Cheap["claude-haiku ~$0.25 / 1M tokens"]
P["Planner\nJSON output only"]
V["Validator\nScore + one sentence"]
end
subgraph Capable["claude-sonnet ~$3 / 1M tokens"]
E["Executor\nTool calls + reasoning"]
end
Typical cost for a 3-step task: ~$0.01 vs ~$0.027 if using sonnet throughout.
# Unit tests — no API calls, runs in ~5 seconds
uv run pytest tests/ -m "not slow" -v
# Integration tests — requires ANTHROPIC_API_KEY
uv run pytest tests/ -m slow -vTest coverage:
| File | What it tests |
|---|---|
test_planner.py |
First-plan vs replan, state resets, step injection |
test_executor.py |
Context injection, retry feedback, tool-call loop, unknown tools |
test_validator.py |
Score/feedback writing, score clamping, StepResult audit trail |
test_retry_replan.py |
Every router branch — 12 routing scenarios |
test_graph.py |
Config validation, graph compilation, initial_state, dispatch edge |
langgraph-plan-execute-validate/
├── src/pev/
│ ├── __init__.py # Public API: create_pev_graph, initial_state, PEVConfig, PEVState
│ ├── graph.py # StateGraph wiring + router node + _dispatch edge
│ ├── state.py # PEVState TypedDict, StepResult, Status
│ ├── config.py # PEVConfig dataclass with validation
│ ├── prompts.py # All prompt templates (one place, easy to tune)
│ └── nodes/
│ ├── planner.py # Planner node — structured output, replan-aware
│ ├── executor.py # Executor node — tool-call loop, context injection
│ └── validator.py # Validator node — confidence scoring, audit append
├── examples/
│ ├── research_agent.py # Tavily web search
│ ├── code_review_agent.py # Strict threshold, shows retry flow
│ └── data_analysis_agent.py # No tools, LLM reasoning only
├── tests/
│ ├── conftest.py # Mock LLM fixtures, state builders
│ ├── test_planner.py
│ ├── test_executor.py
│ ├── test_validator.py
│ ├── test_retry_replan.py # Router decision tree — most critical tests
│ └── test_graph.py # Compilation, config validation, dispatch
├── docs/
│ └── architecture.md # Full architecture with 10 Mermaid diagrams
└── .github/workflows/
└── ci.yml # Ruff + pytest (unit only) + Codecov
Custom validator — override the pass logic entirely:
from pev.nodes.validator import make_validator_node
from pev.config import PEVConfig
# The validator prompt is in pev/prompts.py — edit VALIDATOR_SYSTEM to
# change scoring criteria without touching node logic.Custom tools — any LangChain BaseTool or MCP Tools:
from langchain_mcp_adapters.tools import load_mcp_tools
# Connect to an MCP server (e.g., your FastMCP production server)
mcp_tools = load_mcp_tools("uv", ["run", "server.py"])
cfg = PEVConfig(tools=mcp_tools)Async execution:
result = await graph.ainvoke(initial_state("Your task"))Manjunath Govindaraju — Principal Software Engineer with 23 years building production systems. Currently focused on AI platform engineering: multi-agent orchestration, MCP servers at scale, async data pipelines, and enterprise Kubernetes deployments.
LinkedIn · GitHub · fastmcp-production-template
MIT — use freely in production, no attribution required.