🤖 AgentSQL: Asymmetric Multi-Agent Text-to-SQL

AgentSQL is a production-grade, asymmetric multi-agent framework designed to solve the Text-to-SQL dilemma: Balancing high Execution Accuracy (EX) with cost-efficiency.

By decoupling the high-volume Generation task from the complex Correction/Reasoning task, AgentSQL achieves state-of-the-art results on the BIRD benchmark while maintaining a significantly lower inference cost compared to monolithic frontier model approaches.

🏗️ Architecture: Asymmetric MasterPipeline

AgentSQL utilizes an Asymmetric Multi-Agent Architecture (MasterPipeline). The workflow strictly isolates offline pre-processing from online inference, allowing for specialized model selection and optimized token usage at each step.

Tip

High-Quality Diagram: TikZ source lives at latex_playground/tikz_artifacts/agentsql_workflow.tex.

Pipeline Phases

Phase 1: CHESS Pruning (tools/chess_linker.py): Offline semantic filtering using lightweight embedding models (e.g., bge-small) to isolate only the most relevant tables and eliminate schema noise.
Phase 2: MCI-SQL Enrichment (tools/mci_sql_pipeline.py): Extracts precise metadata (cardinalities, min/max values, exact row samples) from the pruned schema to build a high-fidelity context.
Phase 4a/b: Generator & Reflector (tools/master_pipeline.py): The core generation loop. An optimized open-source model (e.g., gpt-oss-120b or llama-4-scout-17b) generates the SQL, which is immediately evaluated by a Reflector for logical self-consistency via back-translation.
Phase 4c: Resilient Critic (nodes/corrector.py): Activated only if the Execution Sandbox detects a syntax error or the Reflector detects a logical mismatch. Powered by a high-reasoning model (e.g., gemini-2.5-flash), it performs targeted patching using the MAGIC checklist.

✨ Key Features

🛡️ Ephemeral Sandboxing: Native support for SQLite, MySQL, and PostgreSQL with automatic state reset and set-based result comparison.
🔄 Round-Robin Key Rotation: The KeyRotator abstraction supports multiple API keys per provider to prevent rate-limiting during large-scale evaluations.
🔌 Resilient LLM Factory: Automatic fallback to local Ollama instances if all cloud API keys are exhausted or unavailable.
📊 Unified Research Suite: A centralized evaluation engine that calculates EX, VES, and Soft F1 metrics in a single pass.

📈 Evaluation Metrics

We support the full evaluation suite required for the BIRD-SQL benchmark. To ensure robust mathematical alignment with the benchmark, our evaluation engine computes:

1. Execution Accuracy (EX)

Execution Accuracy measures the proportion of questions where the predicted SQL query returns the exact same result set as the ground-truth SQL query.

$$ \text{EX} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left(V(Y_i) = V(\hat{Y}_i)\right) $$

Where:

$N$ is the total number of evaluation samples.
$V(Y_i)$ is the execution result set of the ground-truth SQL $Y_i$.
$V(\hat{Y}_i)$ is the execution result set of the predicted SQL $\hat{Y}_i$.
$\mathbb{I}(\cdot)$ is the indicator function, returning $1$ if the condition is true and $0$ otherwise.

2. Valid Efficiency Score (VES)

Valid Efficiency Score evaluates the computational efficiency of the valid generated SQL queries, measuring execution speed relative to the human-written ground truth.

$$ \text{VES} = \frac{\sum_{i=1}^{N} \mathbb{I}\left(V(Y_i) = V(\hat{Y}_i)\right) \cdot R(Y_i, \hat{Y}_i)}{\sum_{i=1}^{N} \mathbb{I}\left(V(Y_i) = V(\hat{Y}_i)\right)} $$

Where the reward $R(Y_i, \hat{Y}i)$ is defined based on the relative execution efficiency $\tau = \frac{\tau{Y_i}}{\tau_{\hat{Y}_i}}$:

$R = 1.25$ if $\tau \geq 2$ (Predicted is at least 2x faster)
$R = 1.00$ if $1 \leq \tau < 2$ (Predicted is faster or equal)
$R = 0.75$ if $0.5 \leq \tau < 1$ (Predicted is slightly slower)
$R = 0.50$ if $\tau < 0.5$ (Predicted is significantly slower)

3. Soft F1 Score (Semantic F1)

Soft F1 acts as a proxy for partial correctness. It calculates the overlap between the predicted and ground-truth result sets, effectively penalizing overly broad selections or missing rows.

$$ \text{Soft F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

Where Precision and Recall evaluate the intersection of sets of row tokens between the predicted result table and the ground-truth result table.

Note

Recent evaluations of the MasterPipeline on the BIRD Mini-Dev dataset demonstrate highly competitive Execution Accuracy (EX) while significantly reducing API costs compared to monolithic GPT-4/Claude-3 setups.

🚀 Quick Start

1. Environment Setup

Populate your .env file with multiple keys for high-concurrency evaluation:

cp .env.example .env
# Fill GEMINI_API_KEY_1, GEMINI_API_KEY_2, GROQ_API_KEY_1, etc.

2. Launch with Docker

The framework is fully containerized for reproducibility:

make build
make up
make shell

3. Run Evaluation

Build the CHESS FAISS index once (required for MasterPipeline): make build-index. Optional smoke test for the LangGraph agent: make smoke.

MasterPipeline on Mini-Dev:

make eval-master NUM_SAMPLES=20

LangGraph-only evaluation:

make eval-langgraph NUM_SAMPLES=20

📁 Project Structure

.
├── research/                 # Evaluators (LangGraph + MasterPipeline), metrics, SoTA compare
├── src/
│   ├── text2sql_agent/        # LangGraph workflow, tools, MasterPipeline
│   ├── build_offline_index.py # CHESS FAISS index builder
│   └── smoke_test_agent.py    # One-shot graph smoke test
├── scripts/                   # Dataset download helpers
├── data_minidev/              # BIRD mini-dev (gitignored; use make pull-data)
├── Makefile
└── docker-compose.yml

👥 Authors

Implemented with ❤️ by the HCMUS Underdogs team. Dedicated to scaling agentic AI workflows with rigor and resilience.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
latex_playground		latex_playground
research		research
results		results
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
REPRODUCTION_GUIDE.md		REPRODUCTION_GUIDE.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 AgentSQL: Asymmetric Multi-Agent Text-to-SQL

🏗️ Architecture: Asymmetric MasterPipeline

Pipeline Phases

✨ Key Features

📈 Evaluation Metrics

1. Execution Accuracy (EX)

2. Valid Efficiency Score (VES)

3. Soft F1 Score (Semantic F1)

🚀 Quick Start

1. Environment Setup

2. Launch with Docker

3. Run Evaluation

📁 Project Structure

👥 Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 AgentSQL: Asymmetric Multi-Agent Text-to-SQL

🏗️ Architecture: Asymmetric MasterPipeline

Pipeline Phases

✨ Key Features

📈 Evaluation Metrics

1. Execution Accuracy (EX)

2. Valid Efficiency Score (VES)

3. Soft F1 Score (Semantic F1)

🚀 Quick Start

1. Environment Setup

2. Launch with Docker

3. Run Evaluation

📁 Project Structure

👥 Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages