π¬ Ask a question in plain English. π Get back a SQL query and real results from the database. No SQL knowledge required.
Audience: No prior knowledge of LLMs or vector databases is assumed.
- π€ What Does This Project Do?
- π¬ The Problem With "Typical" Text-to-SQL
- π οΈ How This Project Solves It
- ποΈ High-Level Architecture
- π Step-by-Step: How a Query Works
- β What It Can (and Cannot) Do
- π Project Structure
- ποΈ Database Schema
- π Quick Start
- βοΈ Environment Variables
This system lets a non-technical user type a question like:
"What are the top 10 product categories by revenue this year?"
β¦and automatically:
- πΊοΈ Figures out which database tables are relevant to the question.
- βοΈ Asks an LLM (GPT-4o) to write the correct SQL query.
- β‘ Runs that SQL against a real database (Olist Brazilian E-Commerce dataset).
- π Returns the results in a clean table in the browser.
The underlying database is a star-schema data warehouse built on the Olist public dataset, which contains real Brazilian e-commerce orders, products, sellers, customers, and reviews.
When someone first tries to build a text-to-SQL system, the obvious approach is usually:
"I'll just paste the entire database schema into the LLM prompt and ask it to write a query."
This works fine for toy databases with 3β4 tables. In the real world, it breaks down fast π₯:
| Problem | Why it happens |
|---|---|
| π Context window overflow | A real warehouse can have 50β200 tables. The full schema β table names, column names, types, foreign keys β can easily exceed the LLM's context limit (even GPT-4o's 128k tokens). |
| π’ Noise drowns out signal | Even if it fits, flooding the prompt with irrelevant tables confuses the model. It might join to the wrong table or use the wrong column because it's overwhelmed. |
| π Business meaning is lost | Column names like order_total_usd or is_active_member don't tell the LLM how to use them correctly. Should order_total_usd be summed or averaged? Should you filter on order_status = 'delivered' first? The schema alone doesn't say. |
| π» Hallucinated SQL | Without enough context, the LLM invents column names or table relationships that don't exist, producing queries that fail at runtime. |
| π¨ No safety guardrails | A naive implementation has no check to stop the LLM from generating DELETE or DROP TABLE statements if the user's question is phrased ambiguously. |
This project solves all of these problems. π―
Instead of dumping the whole schema into the prompt, we use three key techniques:
Every table and column is annotated with business-friendly descriptions in agent/semantic_layer.py. For example:
order_total_usd: "Final post-tax revenue in USD for this line item.
Always use this for GMV calculations.
Never use freight_value_usd as a revenue proxy."
This tells the LLM how to use each column, not just what type it is. Think of it as a data dictionary that the LLM reads before writing SQL. π
Instead of sending all table descriptions at once, we:
- At startup: Embed every table description into a vector database (ChromaDB) using OpenAI embeddings. Each table becomes a point in high-dimensional space. π
- At query time: Embed the user's question using the same embedding model. Find the 3 most semantically similar table descriptions using cosine similarity. Inject only those 3 tables into the LLM prompt. π―
The user asks about "revenue by category" β we retrieve fact_orders and dim_products β the LLM only sees those two tables β cleaner, more accurate SQL. β¨
The prompt includes curated questionβSQL example pairs (stored in agent/few_shot_examples.yaml). These teach the LLM the exact SQL dialect, join patterns, and aggregation idioms expected for this specific database, acting as in-context learning.
Before any SQL is executed, a Human-In-The-Loop (HITL) guard scans for dangerous keywords (INSERT, UPDATE, DELETE, DROP, etc.). If found, execution is blocked and the user must explicitly type CONFIRM in a modal before anything runs. Pure SELECT queries pass through automatically. π
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser (React) β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββ β
β β ChatWindow β β SqlDisplay β β ResultsTable β β
β β (ask a Q) β β (show SQL) β β (show rows) β β
β ββββββββ¬ββββββββ ββββββββββββββββ βββββββββββββββββββββ β
β β POST /query β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend (Python) β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β sql_chain.py (LCEL pipeline) β β
β β β β
β β Question β β
β β β β β
β β βΌ β β
β β [1] retriever.py βββΊ ChromaDB (vector store) β β
β β β (embed question, find top-3 relevant tables) β β
β β βΌ β β
β β [2] Load few_shot_examples.yaml β β
β β β β β
β β βΌ β β
β β [3] Build ChatPromptTemplate β β
β β β (schema + examples + question) β β
β β βΌ β β
β β [4] GPT-4o (temperature=0) βββ OpenAI API β β
β β β (generate SQL) β β
β β βΌ β β
β β [5] hitl_guard.py β β
β β β (block writes, require human approval) β β
β β βΌ β β
β β [6] Execute SQL βββΊ SQLite / PostgreSQL β β
β β β β β
β β βΌ β β
β β [7] Log to query_log table β β
β β β β β
β β βΌ β β
β β Return {sql, results, latency_ms} β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββΌβββββββββββββββββββββββββββββ
β ChromaDB (vector store) β
β Table descriptions stored as vectors β
β Persisted to ./chroma_store/ β
ββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββΌβββββββββββββββββββββββββββββ
β SQLite (./data/olist.db) β
β fact_orders, dim_users, β
β dim_products, dim_sellers, β
β dim_geography, dim_reviews, β
β query_log β
ββββββββββββββββββββββββββββββββββββββββ
Here is exactly what happens when a user types "Which states have the most canceled orders?" and hits Enter:
Step 1 β Embed the question π’
The question is converted to a 1536-dimension vector using OpenAI's text-embedding-3-small model.
Step 2 β Retrieve relevant schema (RAG) ποΈ
ChromaDB compares the question vector against all stored table description vectors and returns the 3 closest matches β in this case fact_orders (has order_status) and dim_users (has state).
Step 3 β Build the prompt π
A ChatPromptTemplate is assembled with:
- A system message containing the retrieved table schemas and a few QβSQL examples.
- A user message containing the question.
Step 4 β Ask GPT-4o π€
The prompt is sent to GPT-4o with temperature=0 (fully deterministic β no creative variation in SQL). The model returns a raw SQL string.
Step 5 β Clean the SQL π§Ή
A small regex function strips any markdown code fences (```sql) or stray labels that the model might have included.
Step 6 β HITL safety check π‘οΈ
The SQL is scanned for dangerous keywords. This is a SELECT query, so it passes automatically.
Step 7 β Execute the SQL β‘
The query runs against the SQLite database. A LIMIT 1000 clause is injected automatically if not already present, so the response stays manageable.
Step 8 β Log and return π¦
The question, generated SQL, latency, and tables used are written to the query_log table for observability. The API returns {sql, results, latency_ms} to the frontend.
Step 9 β Display in the browser π₯οΈ React renders the SQL in a syntax-highlighted box and the results as a pageable table.
- π¬ Answer any analytical question about the Olist e-commerce dataset in plain English.
- π° Revenue analysis: total, by category, by seller, by month, by state.
- π₯ Customer analysis: active members, geographic distribution, top spenders, cohorts.
- πͺ Seller analysis: rankings, geographic distribution, freight costs.
- π¦ Order analysis: status breakdown, cancellation rates, monthly trends.
- β Review/NPS analysis: average scores by category, complaint rates.
- π Complex queries: multi-table joins, CTEs (
WITHclauses), window functions. - π£οΈ Explain what SQL it generated and why.
- π§ Block dangerous write operations and ask for human confirmation.
- π« Modify data (INSERT/UPDATE/DELETE) without explicit human approval via the confirmation modal.
- π Query tables or columns outside the defined semantic schema.
- π Answer questions about data that isn't in the Olist dataset (e.g., live stock prices).
- π² Guarantee 100% correct SQL for every possible question β LLM output is probabilistic. Always review the generated SQL before trusting results.
text-to-sql/
β
βββ agent/ # Core AI pipeline
β βββ sql_chain.py # Main LCEL pipeline: question β SQL β results
β βββ retriever.py # RAG: embed question, query ChromaDB
β βββ semantic_layer.py # Business descriptions for every table/column
β βββ build_index.py # One-time script: embed schema into ChromaDB
β βββ hitl_guard.py # Safety: block write SQL, require human approval
β βββ few_shot_examples.yaml # Curated QβSQL examples for in-context learning
β
βββ api/ # FastAPI web server
β βββ main.py # App factory, CORS, error handling
β βββ routes/
β βββ query.py # POST /query β runs the full pipeline
β βββ schema.py # GET /schema β returns table descriptions
β βββ health.py # GET /health β liveness check
β
βββ model/ # SQLAlchemy ORM models
β βββ database.py # Engine + session factory
β βββ schema.py # Table definitions (star schema + query_log)
β
βββ frontend/ # React + TypeScript UI
β βββ src/
β βββ App.tsx # Root component
β βββ api.ts # HTTP client
β βββ components/
β βββ ChatWindow.tsx # Question input box
β βββ SqlDisplay.tsx # Syntax-highlighted SQL output
β βββ ResultsTable.tsx # Pageable results grid
β βββ SchemaExplorer.tsx # Browse available tables/columns
β βββ ApprovalModal.tsx # HITL confirmation dialog
β
βββ data/
β βββ raw/ # Raw Olist CSV files
β βββ seed.py # Load CSVs β SQLite (run once)
β
βββ infra/ # Deployment scripts (Linux/nginx/systemd)
β
βββ requirements.txt # Python dependencies
βββ .env.example # Copy to .env and fill in your keys
The database uses a star schema β a design pattern common in data warehouses where one central "fact" table holds measurable events, and multiple "dimension" tables hold descriptive attributes. β
βββββββββββββββ
β dim_users β
β user_id PK β
β city β
β state β
ββββββββ¬βββββββ
β FK
ββββββββββββββββ ββββββββΌβββββββββββββ βββββββββββββββββ
β dim_products β β fact_orders β β dim_sellers β
β product_id PKββββββ order_id PK βββββΊβ seller_id PK β
β category_nameβ β user_id FK β β seller_city β
β photos_qty β β product_id FK β β seller_state β
ββββββββββββββββ β seller_id FK β βββββββββββββββββ
β order_total_usd β
β order_status β βββββββββββββββββ
β created_at βββββΊβ dim_reviews β
βββββββββββββββββββββ β review_id PK β
β order_id FK β
βββββββββββββββββββββ β review_score β
β dim_geography β βββββββββββββββββ
β geo_id PK β
β zip_code_prefix β
β city, state β
β lat, lng β
βββββββββββββββββββββ
- π Python 3.11+
- π’ Node.js 18+
- π An OpenAI API key
git clone https://github.com/nerdjerry/text-to-sql.git
cd text-to-sql
pip install -r requirements.txtcp .env.example .env
# βοΈ Edit .env and set your OPENAI_API_KEYDownload the raw Olist CSVs into data/raw/ (see Kaggle link in What Does This Project Do?), then run:
python -m data.seedThis embeds all table descriptions into ChromaDB. Run once, or re-run whenever you update semantic_layer.py:
python -m agent.build_indexuvicorn api.main:app --reload --port 8000cd frontend
npm install
npm run devOpen http://localhost:5173 in your browser and start asking questions! π
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | Your OpenAI API key. Get one at platform.openai.com. |
OPENAI_MODEL |
gpt-4o |
The chat model used to generate SQL. |
DATABASE_URL |
sqlite:///./data/olist.db |
SQLAlchemy connection string. Use postgresql://... for Postgres. |
CHROMA_PERSIST_DIR |
./chroma_store |
Directory where ChromaDB persists vector embeddings. |
EMBEDDING_MODEL |
text-embedding-3-small |
OpenAI embedding model used for both indexing and retrieval. |
LOG_LEVEL |
INFO |
Python logging level (DEBUG, INFO, WARNING, ERROR). |
ALLOWED_ORIGINS |
* |
Comma-separated CORS origins. Set to your domain in production. |