Skip to content

techwithprateek/text-to-sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Text-to-SQL

πŸ’¬ Ask a question in plain English. πŸ” Get back a SQL query and real results from the database. No SQL knowledge required.

Audience: No prior knowledge of LLMs or vector databases is assumed.


πŸ“š Table of Contents

  1. πŸ€” What Does This Project Do?
  2. 😬 The Problem With "Typical" Text-to-SQL
  3. πŸ› οΈ How This Project Solves It
  4. πŸ—οΈ High-Level Architecture
  5. πŸ”„ Step-by-Step: How a Query Works
  6. βœ… What It Can (and Cannot) Do
  7. πŸ“ Project Structure
  8. πŸ—„οΈ Database Schema
  9. πŸš€ Quick Start
  10. βš™οΈ Environment Variables

1. πŸ€” What Does This Project Do?

This system lets a non-technical user type a question like:

"What are the top 10 product categories by revenue this year?"

…and automatically:

  1. πŸ—ΊοΈ Figures out which database tables are relevant to the question.
  2. ✍️ Asks an LLM (GPT-4o) to write the correct SQL query.
  3. ⚑ Runs that SQL against a real database (Olist Brazilian E-Commerce dataset).
  4. πŸ“Š Returns the results in a clean table in the browser.

The underlying database is a star-schema data warehouse built on the Olist public dataset, which contains real Brazilian e-commerce orders, products, sellers, customers, and reviews.


2. 😬 The Problem With "Typical" Text-to-SQL

When someone first tries to build a text-to-SQL system, the obvious approach is usually:

"I'll just paste the entire database schema into the LLM prompt and ask it to write a query."

This works fine for toy databases with 3–4 tables. In the real world, it breaks down fast πŸ’₯:

Problem Why it happens
🌊 Context window overflow A real warehouse can have 50–200 tables. The full schema β€” table names, column names, types, foreign keys β€” can easily exceed the LLM's context limit (even GPT-4o's 128k tokens).
πŸ“’ Noise drowns out signal Even if it fits, flooding the prompt with irrelevant tables confuses the model. It might join to the wrong table or use the wrong column because it's overwhelmed.
πŸ” Business meaning is lost Column names like order_total_usd or is_active_member don't tell the LLM how to use them correctly. Should order_total_usd be summed or averaged? Should you filter on order_status = 'delivered' first? The schema alone doesn't say.
πŸ‘» Hallucinated SQL Without enough context, the LLM invents column names or table relationships that don't exist, producing queries that fail at runtime.
🚨 No safety guardrails A naive implementation has no check to stop the LLM from generating DELETE or DROP TABLE statements if the user's question is phrased ambiguously.

This project solves all of these problems. 🎯


3. πŸ› οΈ How This Project Solves It

Instead of dumping the whole schema into the prompt, we use three key techniques:

πŸ“– Technique 1: Semantic Layer (the "data dictionary")

Every table and column is annotated with business-friendly descriptions in agent/semantic_layer.py. For example:

order_total_usd: "Final post-tax revenue in USD for this line item.
                  Always use this for GMV calculations.
                  Never use freight_value_usd as a revenue proxy."

This tells the LLM how to use each column, not just what type it is. Think of it as a data dictionary that the LLM reads before writing SQL. πŸ“š

πŸ”Ž Technique 2: RAG β€” Retrieval-Augmented Generation

Instead of sending all table descriptions at once, we:

  1. At startup: Embed every table description into a vector database (ChromaDB) using OpenAI embeddings. Each table becomes a point in high-dimensional space. πŸ“Œ
  2. At query time: Embed the user's question using the same embedding model. Find the 3 most semantically similar table descriptions using cosine similarity. Inject only those 3 tables into the LLM prompt. 🎯

The user asks about "revenue by category" β†’ we retrieve fact_orders and dim_products β†’ the LLM only sees those two tables β†’ cleaner, more accurate SQL. ✨

πŸ’‘ Technique 3: Few-Shot Examples

The prompt includes curated question→SQL example pairs (stored in agent/few_shot_examples.yaml). These teach the LLM the exact SQL dialect, join patterns, and aggregation idioms expected for this specific database, acting as in-context learning.

πŸ›‘οΈ Technique 4: HITL Safety Guard

Before any SQL is executed, a Human-In-The-Loop (HITL) guard scans for dangerous keywords (INSERT, UPDATE, DELETE, DROP, etc.). If found, execution is blocked and the user must explicitly type CONFIRM in a modal before anything runs. Pure SELECT queries pass through automatically. πŸ”’


4. πŸ—οΈ High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Browser (React)                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  ChatWindow  β”‚  β”‚  SqlDisplay  β”‚  β”‚   ResultsTable    β”‚  β”‚
β”‚  β”‚ (ask a Q)    β”‚  β”‚ (show SQL)   β”‚  β”‚ (show rows)       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚ POST /query                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FastAPI Backend (Python)                   β”‚
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                   sql_chain.py (LCEL pipeline)        β”‚   β”‚
β”‚  β”‚                                                       β”‚   β”‚
β”‚  β”‚  Question                                             β”‚   β”‚
β”‚  β”‚     β”‚                                                 β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  [1] retriever.py ──► ChromaDB (vector store)        β”‚   β”‚
β”‚  β”‚     β”‚   (embed question, find top-3 relevant tables)  β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  [2] Load few_shot_examples.yaml                      β”‚   β”‚
β”‚  β”‚     β”‚                                                 β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  [3] Build ChatPromptTemplate                         β”‚   β”‚
β”‚  β”‚     β”‚   (schema + examples + question)               β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  [4] GPT-4o (temperature=0) ◄── OpenAI API           β”‚   β”‚
β”‚  β”‚     β”‚   (generate SQL)                               β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  [5] hitl_guard.py                                    β”‚   β”‚
β”‚  β”‚     β”‚   (block writes, require human approval)       β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  [6] Execute SQL ──► SQLite / PostgreSQL              β”‚   β”‚
β”‚  β”‚     β”‚                                                 β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  [7] Log to query_log table                           β”‚   β”‚
β”‚  β”‚     β”‚                                                 β”‚   β”‚
β”‚  β”‚     β–Ό                                                 β”‚   β”‚
β”‚  β”‚  Return {sql, results, latency_ms}                    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         ChromaDB (vector store)       β”‚
β”‚  Table descriptions stored as vectors β”‚
β”‚  Persisted to ./chroma_store/         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     SQLite (./data/olist.db)          β”‚
β”‚  fact_orders, dim_users,              β”‚
β”‚  dim_products, dim_sellers,           β”‚
β”‚  dim_geography, dim_reviews,          β”‚
β”‚  query_log                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5. πŸ”„ Step-by-Step: How a Query Works

Here is exactly what happens when a user types "Which states have the most canceled orders?" and hits Enter:

Step 1 β†’ Embed the question πŸ”’ The question is converted to a 1536-dimension vector using OpenAI's text-embedding-3-small model.

Step 2 β†’ Retrieve relevant schema (RAG) πŸ—‚οΈ ChromaDB compares the question vector against all stored table description vectors and returns the 3 closest matches β€” in this case fact_orders (has order_status) and dim_users (has state).

Step 3 β†’ Build the prompt πŸ“ A ChatPromptTemplate is assembled with:

  • A system message containing the retrieved table schemas and a few Qβ†’SQL examples.
  • A user message containing the question.

Step 4 β†’ Ask GPT-4o πŸ€– The prompt is sent to GPT-4o with temperature=0 (fully deterministic β€” no creative variation in SQL). The model returns a raw SQL string.

Step 5 β†’ Clean the SQL 🧹 A small regex function strips any markdown code fences (```sql) or stray labels that the model might have included.

Step 6 β†’ HITL safety check πŸ›‘οΈ The SQL is scanned for dangerous keywords. This is a SELECT query, so it passes automatically.

Step 7 β†’ Execute the SQL ⚑ The query runs against the SQLite database. A LIMIT 1000 clause is injected automatically if not already present, so the response stays manageable.

Step 8 β†’ Log and return πŸ“¦ The question, generated SQL, latency, and tables used are written to the query_log table for observability. The API returns {sql, results, latency_ms} to the frontend.

Step 9 β†’ Display in the browser πŸ–₯️ React renders the SQL in a syntax-highlighted box and the results as a pageable table.


6. βœ… What It Can (and Cannot) Do

βœ… Can Do

  • πŸ’¬ Answer any analytical question about the Olist e-commerce dataset in plain English.
  • πŸ’° Revenue analysis: total, by category, by seller, by month, by state.
  • πŸ‘₯ Customer analysis: active members, geographic distribution, top spenders, cohorts.
  • πŸͺ Seller analysis: rankings, geographic distribution, freight costs.
  • πŸ“¦ Order analysis: status breakdown, cancellation rates, monthly trends.
  • ⭐ Review/NPS analysis: average scores by category, complaint rates.
  • πŸ”€ Complex queries: multi-table joins, CTEs (WITH clauses), window functions.
  • πŸ—£οΈ Explain what SQL it generated and why.
  • 🚧 Block dangerous write operations and ask for human confirmation.

❌ Cannot Do

  • 🚫 Modify data (INSERT/UPDATE/DELETE) without explicit human approval via the confirmation modal.
  • πŸ”’ Query tables or columns outside the defined semantic schema.
  • 🌐 Answer questions about data that isn't in the Olist dataset (e.g., live stock prices).
  • 🎲 Guarantee 100% correct SQL for every possible question β€” LLM output is probabilistic. Always review the generated SQL before trusting results.

7. πŸ“ Project Structure

text-to-sql/
β”‚
β”œβ”€β”€ agent/                      # Core AI pipeline
β”‚   β”œβ”€β”€ sql_chain.py            # Main LCEL pipeline: question β†’ SQL β†’ results
β”‚   β”œβ”€β”€ retriever.py            # RAG: embed question, query ChromaDB
β”‚   β”œβ”€β”€ semantic_layer.py       # Business descriptions for every table/column
β”‚   β”œβ”€β”€ build_index.py          # One-time script: embed schema into ChromaDB
β”‚   β”œβ”€β”€ hitl_guard.py           # Safety: block write SQL, require human approval
β”‚   └── few_shot_examples.yaml  # Curated Qβ†’SQL examples for in-context learning
β”‚
β”œβ”€β”€ api/                        # FastAPI web server
β”‚   β”œβ”€β”€ main.py                 # App factory, CORS, error handling
β”‚   └── routes/
β”‚       β”œβ”€β”€ query.py            # POST /query β€” runs the full pipeline
β”‚       β”œβ”€β”€ schema.py           # GET /schema β€” returns table descriptions
β”‚       └── health.py           # GET /health β€” liveness check
β”‚
β”œβ”€β”€ model/                      # SQLAlchemy ORM models
β”‚   β”œβ”€β”€ database.py             # Engine + session factory
β”‚   └── schema.py               # Table definitions (star schema + query_log)
β”‚
β”œβ”€β”€ frontend/                   # React + TypeScript UI
β”‚   └── src/
β”‚       β”œβ”€β”€ App.tsx             # Root component
β”‚       β”œβ”€β”€ api.ts              # HTTP client
β”‚       └── components/
β”‚           β”œβ”€β”€ ChatWindow.tsx       # Question input box
β”‚           β”œβ”€β”€ SqlDisplay.tsx       # Syntax-highlighted SQL output
β”‚           β”œβ”€β”€ ResultsTable.tsx     # Pageable results grid
β”‚           β”œβ”€β”€ SchemaExplorer.tsx   # Browse available tables/columns
β”‚           └── ApprovalModal.tsx    # HITL confirmation dialog
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # Raw Olist CSV files
β”‚   └── seed.py                 # Load CSVs β†’ SQLite (run once)
β”‚
β”œβ”€β”€ infra/                      # Deployment scripts (Linux/nginx/systemd)
β”‚
β”œβ”€β”€ requirements.txt            # Python dependencies
└── .env.example                # Copy to .env and fill in your keys

8. πŸ—„οΈ Database Schema

The database uses a star schema β€” a design pattern common in data warehouses where one central "fact" table holds measurable events, and multiple "dimension" tables hold descriptive attributes. ⭐

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  dim_users  β”‚
                    β”‚  user_id PK β”‚
                    β”‚  city       β”‚
                    β”‚  state      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ FK
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dim_products β”‚    β”‚   fact_orders     β”‚    β”‚  dim_sellers  β”‚
β”‚ product_id PK│◄───│   order_id PK     │───►│  seller_id PK β”‚
β”‚ category_nameβ”‚    β”‚   user_id FK      β”‚    β”‚  seller_city  β”‚
β”‚ photos_qty   β”‚    β”‚   product_id FK   β”‚    β”‚  seller_state β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   seller_id FK    β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚   order_total_usd β”‚
                    β”‚   order_status    β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   created_at      │───►│  dim_reviews  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  review_id PK β”‚
                                             β”‚  order_id FK  β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  review_score β”‚
                    β”‚  dim_geography    β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚  geo_id PK        β”‚
                    β”‚  zip_code_prefix  β”‚
                    β”‚  city, state      β”‚
                    β”‚  lat, lng         β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

9. πŸš€ Quick Start

Prerequisites

  • 🐍 Python 3.11+
  • 🟒 Node.js 18+
  • πŸ”‘ An OpenAI API key

1. Clone and install Python dependencies

git clone https://github.com/nerdjerry/text-to-sql.git
cd text-to-sql
pip install -r requirements.txt

2. Configure environment variables

cp .env.example .env
# ✏️ Edit .env and set your OPENAI_API_KEY

3. Seed the database 🌱

Download the raw Olist CSVs into data/raw/ (see Kaggle link in What Does This Project Do?), then run:

python -m data.seed

4. Build the vector index πŸ”’

This embeds all table descriptions into ChromaDB. Run once, or re-run whenever you update semantic_layer.py:

python -m agent.build_index

5. Start the API server πŸ–₯️

uvicorn api.main:app --reload --port 8000

6. Start the frontend 🎨

cd frontend
npm install
npm run dev

Open http://localhost:5173 in your browser and start asking questions! πŸŽ‰


10. βš™οΈ Environment Variables

Variable Default Description
OPENAI_API_KEY (required) Your OpenAI API key. Get one at platform.openai.com.
OPENAI_MODEL gpt-4o The chat model used to generate SQL.
DATABASE_URL sqlite:///./data/olist.db SQLAlchemy connection string. Use postgresql://... for Postgres.
CHROMA_PERSIST_DIR ./chroma_store Directory where ChromaDB persists vector embeddings.
EMBEDDING_MODEL text-embedding-3-small OpenAI embedding model used for both indexing and retrieval.
LOG_LEVEL INFO Python logging level (DEBUG, INFO, WARNING, ERROR).
ALLOWED_ORIGINS * Comma-separated CORS origins. Set to your domain in production.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors