AgentSQL is a production-grade, asymmetric multi-agent framework designed to solve the Text-to-SQL dilemma: Balancing high Execution Accuracy (EX) with cost-efficiency.
By decoupling the high-volume Generation task from the complex Correction/Reasoning task, AgentSQL achieves state-of-the-art results on the BIRD benchmark while maintaining a significantly lower inference cost compared to monolithic frontier model approaches.
AgentSQL utilizes an Asymmetric Multi-Agent Architecture (MasterPipeline). The workflow strictly isolates offline pre-processing from online inference, allowing for specialized model selection and optimized token usage at each step.
Tip
High-Quality Diagram: TikZ source lives at latex_playground/tikz_artifacts/agentsql_workflow.tex.
- Phase 1: CHESS Pruning (
tools/chess_linker.py): Offline semantic filtering using lightweight embedding models (e.g.,bge-small) to isolate only the most relevant tables and eliminate schema noise. - Phase 2: MCI-SQL Enrichment (
tools/mci_sql_pipeline.py): Extracts precise metadata (cardinalities, min/max values, exact row samples) from the pruned schema to build a high-fidelity context. - Phase 4a/b: Generator & Reflector (
tools/master_pipeline.py): The core generation loop. An optimized open-source model (e.g.,gpt-oss-120borllama-4-scout-17b) generates the SQL, which is immediately evaluated by a Reflector for logical self-consistency via back-translation. - Phase 4c: Resilient Critic (
nodes/corrector.py): Activated only if the Execution Sandbox detects a syntax error or the Reflector detects a logical mismatch. Powered by a high-reasoning model (e.g.,gemini-2.5-flash), it performs targeted patching using the MAGIC checklist.
- 🛡️ Ephemeral Sandboxing: Native support for SQLite, MySQL, and PostgreSQL with automatic state reset and set-based result comparison.
- 🔄 Round-Robin Key Rotation: The
KeyRotatorabstraction supports multiple API keys per provider to prevent rate-limiting during large-scale evaluations. - 🔌 Resilient LLM Factory: Automatic fallback to local Ollama instances if all cloud API keys are exhausted or unavailable.
- 📊 Unified Research Suite: A centralized evaluation engine that calculates EX, VES, and Soft F1 metrics in a single pass.
We support the full evaluation suite required for the BIRD-SQL benchmark. To ensure robust mathematical alignment with the benchmark, our evaluation engine computes:
Execution Accuracy measures the proportion of questions where the predicted SQL query returns the exact same result set as the ground-truth SQL query.
Where:
-
$N$ is the total number of evaluation samples. -
$V(Y_i)$ is the execution result set of the ground-truth SQL$Y_i$ . -
$V(\hat{Y}_i)$ is the execution result set of the predicted SQL$\hat{Y}_i$ . -
$\mathbb{I}(\cdot)$ is the indicator function, returning$1$ if the condition is true and$0$ otherwise.
Valid Efficiency Score evaluates the computational efficiency of the valid generated SQL queries, measuring execution speed relative to the human-written ground truth.
Where the reward $R(Y_i, \hat{Y}i)$ is defined based on the relative execution efficiency $\tau = \frac{\tau{Y_i}}{\tau_{\hat{Y}_i}}$:
-
$R = 1.25$ if$\tau \geq 2$ (Predicted is at least 2x faster) -
$R = 1.00$ if$1 \leq \tau < 2$ (Predicted is faster or equal) -
$R = 0.75$ if$0.5 \leq \tau < 1$ (Predicted is slightly slower) -
$R = 0.50$ if$\tau < 0.5$ (Predicted is significantly slower)
Soft F1 acts as a proxy for partial correctness. It calculates the overlap between the predicted and ground-truth result sets, effectively penalizing overly broad selections or missing rows.
Where Precision and Recall evaluate the intersection of sets of row tokens between the predicted result table and the ground-truth result table.
Note
Recent evaluations of the MasterPipeline on the BIRD Mini-Dev dataset demonstrate highly competitive Execution Accuracy (EX) while significantly reducing API costs compared to monolithic GPT-4/Claude-3 setups.
Populate your .env file with multiple keys for high-concurrency evaluation:
cp .env.example .env
# Fill GEMINI_API_KEY_1, GEMINI_API_KEY_2, GROQ_API_KEY_1, etc.The framework is fully containerized for reproducibility:
make build
make up
make shellBuild the CHESS FAISS index once (required for MasterPipeline): make build-index. Optional smoke test for the LangGraph agent: make smoke.
MasterPipeline on Mini-Dev:
make eval-master NUM_SAMPLES=20LangGraph-only evaluation:
make eval-langgraph NUM_SAMPLES=20.
├── research/ # Evaluators (LangGraph + MasterPipeline), metrics, SoTA compare
├── src/
│ ├── text2sql_agent/ # LangGraph workflow, tools, MasterPipeline
│ ├── build_offline_index.py # CHESS FAISS index builder
│ └── smoke_test_agent.py # One-shot graph smoke test
├── scripts/ # Dataset download helpers
├── data_minidev/ # BIRD mini-dev (gitignored; use make pull-data)
├── Makefile
└── docker-compose.yml
Implemented with ❤️ by the HCMUS Underdogs team. Dedicated to scaling agentic AI workflows with rigor and resilience.
