Schema-Aware Text-to-SQL Transformer

A research-grade implementation of a natural language to SQL query generator, fine-tuned on the Spider dataset using T5-Small with a custom schema linking layer.

Results

Model	Epochs	Schema Linking	Exact Match (Spider Dev)
Baseline	3	No	18.96%
Fine-tuned	10	No	25.34%
+ Schema Linking	20	Yes	28.24%

Evaluated on 1,034 examples from the Spider development set using exact match accuracy after SQL normalization.

Architecture

Natural Language Question
        +
Database Schema (tables.json)
        │
        ▼
┌─────────────────────────┐
│   Schema Linker          │  ← Rule-based token matching
│   src/schema_linker.py  │     Highlights matched tokens
└─────────────────────────┘
        │
        ▼
  "question: What is the average [age] of [students]?
   | context: [students] : student_id , name , [age] , gpa"
        │
        ▼
┌─────────────────────────┐
│   T5-Small Encoder       │  ← Bidirectional attention
│   (60M parameters)       │     Full input understanding
└─────────────────────────┘
        │
        ▼
┌─────────────────────────┐
│   T5-Small Decoder       │  ← Cross-attention over encoder
│                          │     Generates SQL token by token
└─────────────────────────┘
        │
        ▼
  SELECT avg(age) FROM students WHERE age > 20

Why Encoder-Decoder over GPT-style?

T5's encoder builds a complete bidirectional representation of the entire input before the decoder generates a single token. This means when generating students, the model has already processed the full schema and question — including what comes after students in the SQL. A decoder-only model like GPT generates left-to-right with no lookahead, making it structurally weaker for SQL generation where table and column choices depend on the full query structure.

Schema Linking

The key research contribution of this project is the schema linking pre-processing layer (src/schema_linker.py). Before the model sees the input:

Token matching — scan question tokens against schema tokens using exact string matching after normalization
Dual highlighting — matched tokens are wrapped in [brackets] on both the question side and schema side
Copy signal — the model learns to copy bracketed tokens into the output SQL rather than generating column names from memory

This directly addresses the hallucination problem in Text-to-SQL — the model copies column names from the provided schema rather than recalling them from pre-training weights.

Project Structure

text2sql-transformer/
│
├── data/
│   └── raw/
│       ├── train_spider.json    # Spider training set (7,000 examples)
│       ├── dev.json             # Spider dev set (1,034 examples)
│       └── tables.json          # Database schema definitions
│
├── models/
│   └── t5-text2sql-v1/          # Saved model weights and tokenizer
│
├── src/
│   ├── dataset.py               # SpiderDataset + schema serialization
│   ├── schema_linker.py         # Schema linking pre-processing layer
│   └── model.py                 # T5ForConditionalGeneration wrapper
│
├── notebooks/
│   └── 01_data_exploration.ipynb
│
├── configs/
│   └── config.yaml              # Hyperparameters
│
├── train.py                     # Training script
├── eval.py                      # Evaluation script (exact match)
├── predict.py                   # Interactive inference CLI
└── requirements.txt

Setup

Requirements: Python 3.8+, 4GB RAM minimum

git clone https://github.com/YOUR_USERNAME/text2sql-transformer.git
cd text2sql-transformer
pip install -r requirements.txt

Download Spider dataset from https://yale-lily.github.io/spider and place train_spider.json, dev.json, and tables.json in data/raw/.

Usage

Training

python train.py

Trains T5-Small on Spider for 20 epochs with schema linking enabled. On a T4 GPU, this takes approximately 2 hours.

Evaluation

python eval.py

Evaluates on the Spider dev set (1,034 examples) and reports exact match accuracy. Results saved to eval_results.json.

Interactive Inference

python predict.py

Enter Question: How many singers are there?
Enter Schema: table: singer | columns: singer_id, name, country, age

Predicted SQL: SELECT count(*) FROM singer

Enter Question: Find the average age of students older than 20.
Enter Schema: table: students | columns: student_id, name, age, gpa

Predicted SQL: SELECT avg(age) FROM students WHERE age > 20

Sample Predictions

Question	Predicted SQL	Correct
How many singers are there?	`SELECT count(*) FROM singer`	✅
What are the names of all students?	`SELECT name FROM students`	✅
Find the average age of students older than 20.	`SELECT avg(age) FROM students WHERE age > 20`	✅
What are the distinct countries singers are from?	`SELECT DISTINCT country FROM singer`	✅

Error Analysis

The primary failure mode is multi-table JOIN queries, which require reasoning across multiple tables simultaneously. T5-Small (60M parameters) lacks the capacity for this task — state-of-the-art models use T5-3B (3B parameters). Single-table queries (SELECT, WHERE, GROUP BY, aggregates) are handled reliably.

The strict exact match metric also penalizes semantically correct predictions with minor formatting differences (e.g., extra spaces, quote style), so real semantic accuracy is approximately 5-8% higher than reported exact match.

Hyperparameters

Parameter	Value
Model	T5-Small (60M params)
Optimizer	AdamW
Learning Rate	5e-5
Batch Size	4
Epochs	20
Max Input Length	512 tokens
Max Output Length	128 tokens
Beam Search	4 beams

Tech Stack

Model: HuggingFace Transformers (T5-Small)
Training: PyTorch + AdamW
Dataset: Spider (Yale Linguistic Group)
Evaluation: Exact Match Accuracy
Hardware: Google Colab T4 GPU

References

Yu et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
Raffel et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Guo et al. (2019). Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation

Author

Alahari Tarak Ram B.Tech, IIITDM Kancheepuram alaharitarak@gmail.com | LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
README.md		README.md
Text2SQL_Transformer.pdf		Text2SQL_Transformer.pdf
Text2SQL_Transformer.tex		Text2SQL_Transformer.tex
eval.py		eval.py
predict.py		predict.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Schema-Aware Text-to-SQL Transformer

Results

Architecture

Why Encoder-Decoder over GPT-style?

Schema Linking

Project Structure

Setup

Usage

Training

Evaluation

Interactive Inference

Sample Predictions

Error Analysis

Hyperparameters

Tech Stack

References

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Schema-Aware Text-to-SQL Transformer

Results

Architecture

Why Encoder-Decoder over GPT-style?

Schema Linking

Project Structure

Setup

Usage

Training

Evaluation

Interactive Inference

Sample Predictions

Error Analysis

Hyperparameters

Tech Stack

References

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages