Turn any SQL schema into realistic test data — instantly.
- PyPI package:
ddl2data - CLI command:
ddl2data - Python module:
ddl2data
ddl2data turns SQL DDL, a live relational schema, or a live DynamoDB table schema into fake but structured data that is useful for load tests, end-to-end pipeline checks, local development, and staging environment seeding.
For relational schemas it parses tables and foreign keys, generates rows in parent-before-child order, applies type-aware defaults plus optional distributions, validates the generated data, and writes it out as SQL, JSON, CSV, or Parquet. It can also insert generated rows directly into a relational database through SQLAlchemy. For DynamoDB it can load key and index metadata from a real table and emit typed DynamoDB JSON payloads.
Use ddl2data when you need to:
- stress a pipeline or service with large synthetic datasets
- generate relational fixtures from existing DDL
- seed staging or local environments while preserving foreign-key order
- inspect a live schema and generate data without hand-writing metadata
- generate DynamoDB-shaped typed JSON from a real table definition
- export the same dataset shape in SQL, JSON, CSV, Parquet, or DynamoDB JSON
- produce validation reports before loading data elsewhere
Schema input
-> parse DDL, inspect a live relational DB, or inspect a DynamoDB table
-> build metadata and dependency order
-> generate rows with Faker and optional distributions
-> validate FK, null, unique, and optional CHECK constraints
-> write SQL/JSON/CSV/Parquet/DynamoDB JSON or insert into a DB
Core modules:
ddl2data/parser/ddl.py: SQL DDL parsing viasqlglotddl2data/parser/introspect.py: relational schema introspection via SQLAlchemyddl2data/parser/dynamodb.py: DynamoDB schema loading viaboto3ddl2data/generator/base.py: main row generation engineddl2data/generator/dist.py: distribution parsing and samplingddl2data/writer/: SQL, JSON, CSV, Parquet, and DynamoDB JSON writersddl2data/validation.pyandddl2data/report.py: validation and reporting
- Input sources:
- SQL DDL via
--ddl - live relational schema introspection via
--schema-from-db --db-url ... - live DynamoDB table schema via
--schema-from-dynamodb --dynamodb-table ...
- SQL DDL via
- Relationship-aware generation:
- foreign-key dependency graph
- parent tables generated before child tables
- FK-only tables can use the Polars generation path
- Output formats:
- PostgreSQL
INSERT - MySQL
INSERT - SQLite
INSERT - BigQuery
INSERTandINSERT ALL - JSON
- CSV
- Parquet
- DynamoDB typed JSON
- PostgreSQL
- Validation and reporting:
- FK integrity checks
- non-null checks
- unique collision checks
- optional CHECK validation with
--strict-checks - JSON report output via
--report-path
- Generation controls:
- per-table row overrides with
--table-rows - reproducible runs with
--seed - config-file driven runs with JSON, TOML, or YAML
- per-column and per-table distribution overrides with
--dist - optional
pythonorpolarsengine selection
- per-table row overrides with
Recommended for CLI use:
pipx install ddl2dataOr install with pip:
pip install ddl2dataOptional Polars support:
pip install "ddl2data[polars]"DynamoDB schema loading requires boto3:
pip install boto3From source:
git clone https://github.com/hyangminj/ddl2data.git
cd ddl2data
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e .Contributor setup with test and Polars extras:
.venv/bin/python -m pip install -e ".[test,polars]"Sanity check:
ddl2data --helpGitHub Releases also include wheel (.whl) and source (.tar.gz) artifacts.
Minimal schema:
CREATE TABLE users (
id INT PRIMARY KEY,
email VARCHAR(100) UNIQUE NOT NULL,
age INT,
tier VARCHAR(10)
);
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT NOT NULL,
amount NUMERIC,
FOREIGN KEY (user_id) REFERENCES users(id)
);Generate JSON:
ddl2data --ddl schema.sql --rows 100 --out json --output-path data.jsonGenerate PostgreSQL inserts:
ddl2data --ddl schema.sql --rows 100 --out postgres --output-path seed.sqlGenerate CSV files, one per table:
ddl2data --ddl schema.sql --rows 100 --out csv --output-path ./csv_outWith this schema, users is generated first and orders.user_id is filled from generated users.id values.
ddl2data --ddl schema.sql --rows 100 --out postgres
ddl2data --ddl schema.sql --rows 100 --out mysql
ddl2data --ddl schema.sql --rows 100 --out sqlite
ddl2data --ddl schema.sql --rows 100 --out bigqueryddl2data \
--schema-from-db \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
--rows 100 \
--out postgresLimit introspection to selected tables:
ddl2data \
--schema-from-db \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
--tables users,orders,events \
--rows 100 \
--out jsonddl2data \
--ddl schema.sql \
--rows 1000 \
--insert \
--db-url postgresql+psycopg://user:pass@localhost:5432/mydbddl2data \
--schema-from-dynamodb \
--dynamodb-table users \
--dynamodb-region us-east-1 \
--dynamodb-extra-attr email:string \
--dynamodb-extra-attr score:int \
--rows 100 \
--out dynamodb-json \
--output-path users.jsonlNotes:
- key attributes and GSI or LSI key attributes are inferred from the live table
- use
--dynamodb-extra-attr name:typeto add non-key fields to generated output - supported extra attribute aliases include
string,uuid,date,datetime,numeric,float,int, andboolean
ddl2data \
--ddl schema.sql \
--rows 100 \
--table-rows users=20,orders=500 \
--table-rows events=2000 \
--out json--rowsstays the global default--table-rows table=countoverrides individual tables- config files can use a
table_rowsmap
ddl2data \
--ddl schema.sql \
--rows 500 \
--strict-checks \
--report-path report.json \
--out json \
--output-path data.jsonThe report includes counts and sample issues for:
- FK violations
- non-null violations
- unique collisions
- supported CHECK violations when
--strict-checksis enabled
Parquet output:
ddl2data \
--ddl schema.sql \
--rows 100 \
--out parquet \
--parquet-compression zstd \
--output-path ./parquet_outPolars generation and write path:
ddl2data \
--ddl schema.sql \
--rows 100000 \
--out csv \
--engine polars \
--output-path ./csv_outEngine behavior:
- tables with CHECK constraints fall back to the Python row-wise generator
- tables with unique or primary-key constraints fall back to the Python row-wise generator
- FK-only tables can remain on the Polars path
--out parquetrequires the optionalpolarsdependency regardless of--engine
ddl2data --ddl schema.sql --rows 100 --out bigquery --output-path out.sql
ddl2data --ddl schema.sql --rows 100 --out bigquery --bq-insert-all --output-path out_insert_all.sqlBigQuery output supports:
- dataset-qualified table names like
dataset.table INSERT ALLrendering via--bq-insert-all- typed
DATEandTIMESTAMPliterals
ddl2data --config ddl2data.tomlExample ddl2data.toml:
ddl = "schema.sql"
rows = 500
out = "postgres"
seed = 42
strict_checks = true
table_rows = { users = 100, orders = 500, events = 2000 }
dist = [
"users.age:normal,mean=33,std=7",
"orders.amount:pareto,alpha=1.7,xm=1",
]CLI arguments override config file values.
Syntax:
--dist <column_or_table.column>:<kind>,k=v,k=v
Supported distribution kinds:
normalpoissonweightedexponentialparetozipfpeak
Examples:
# global column rule
--dist age:normal,mean=35,std=7
# table-qualified rule (higher priority than global)
--dist users.age:normal,mean=30,std=6
# poisson counts
--dist daily_orders:poisson,lambda=3
# weighted categorical values
--dist tier:weighted,A=60%,B=30%,C=10%
# heavy-tail behavior
--dist amount:pareto,alpha=1.5,xm=1
--dist category_rank:zipf,skew=1.8,n=200
# peak-hour timestamps
--dist created_at:peak,hours=9-11,18-20Priority when both exist:
table.columncolumn
ddl2data uses practical heuristics for common CHECK constraints during generation and can optionally validate them again with --strict-checks.
Supported forms include:
- numeric comparison:
age >= 18,qty > 0,score <= 100,x != 0 - ranges:
price BETWEEN 10 AND 20 - enum lists:
status IN ('A', 'B', 'C') - regex-like checks:
code ~ '^[A-Z]{2}$',REGEXP_LIKE(code, '^[0-9]+$') - nested compound forms built from supported expressions with
ANDandOR
This is intentionally best-effort, not a full SQL expression engine.
Use --seed to stabilize Python random and Faker output:
ddl2data --ddl schema.sql --rows 100 --seed 42 --out jsonpython3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e ".[test,polars]"If you only need the core package:
.venv/bin/python -m pip install -e .If you need DynamoDB schema loading outside the test extra:
.venv/bin/python -m pip install boto3Run the full suite:
.venv/bin/python -m pytestRun only non-integration tests:
.venv/bin/python -m pytest -m "not integration"Run one file:
.venv/bin/python -m pytest tests/test_parser_graph.pyRun one integration target:
.venv/bin/python -m pytest tests/test_integration_postgres.py
.venv/bin/python -m pytest tests/test_integration_dynamodb.py
.venv/bin/python -m pytest tests/test_integration_bigquery.pyAvailable markers:
integrationpostgresdynamodbbigquery
The repo includes docker-compose.yml for PostgreSQL and LocalStack DynamoDB:
docker compose up -d postgres localstack
docker compose psService summary:
- PostgreSQL:
localhost:5432 - LocalStack DynamoDB:
http://localhost:4566
Suggested .env.test:
TEST_POSTGRES_URL=postgresql+psycopg2://testuser:testpass@localhost:5432/testdb
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
AWS_DEFAULT_REGION=us-east-1
DYNAMODB_ENDPOINT_URL=http://localhost:4566
TEST_BQ_PROJECT=your-gcp-project-id
TEST_BQ_DATASET=ddl2data_integration_test
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json.env.test is already ignored and is the right place for local test-only credentials.
Two common options work for the integration tests:
Application Default Credentials:
gcloud auth application-default loginService-account key file:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.jsonIf TEST_BQ_PROJECT is missing or credentials are unavailable, the BigQuery integration tests skip automatically.
ddl2data [--ddl schema.sql
| --schema-from-db --db-url URL [--tables t1,t2]
| --schema-from-dynamodb --dynamodb-table NAME]
[--dynamodb-region REGION]
[--dynamodb-extra-attr name:type]
[--config config.toml]
[--rows 100]
[--table-rows users=20,orders=500] [--table-rows events=2000]
[--out postgres|mysql|sqlite|bigquery|json|csv|parquet|dynamodb-json]
[--engine python|polars]
[--bq-insert-all]
[--output-path PATH]
[--insert --db-url URL]
[--dist ...] [--dist ...]
[--seed INT]
[--report-path report.json]
[--strict-checks]
[--parquet-compression snappy|zstd|lz4|gzip|none]Flag summary:
--config: load defaults from JSON, TOML, YAML, or YML--ddl: input DDL file--schema-from-db: inspect relational table metadata from a live database--tables: optional comma-separated table filter for relational introspection mode--schema-from-dynamodb: inspect a live DynamoDB table definition--dynamodb-table: DynamoDB table name for--schema-from-dynamodb--dynamodb-region: AWS region for DynamoDB schema loading--dynamodb-extra-attr: add synthetic non-key DynamoDB attributes, repeatable--rows: global default rows per table--table-rows: per-table row overrides--out: output format, defaultpostgres--engine:pythonorpolars--bq-insert-all: BigQueryINSERT ALL ... SELECT 1;mode--output-path: output file path or output directory depending on format--db-url: SQLAlchemy database URL for introspection or direct insert--insert: insert generated rows directly into the target relational database--dist: distribution overrides--seed: deterministic generation seed--report-path: write a JSON report with validation summary--strict-checks: validate supported CHECK constraints after generation--parquet-compression: Parquet compression codec
- DDL parsing and relational DB introspection cover common schemas well, but advanced dialect-specific features are still partial.
- DynamoDB schema loading models key and index attributes plus explicitly declared extra attributes; it does not infer full document structure from item samples.
- CHECK-aware generation and
--strict-checkscover a practical subset, not arbitrary SQL expressions. - Function-heavy or cross-column CHECK expressions are best-effort rather than fully modeled.
--engine polarsstill falls back to Python for tables with CHECK, unique, or primary-key constraints.
- More distribution types for realistic synthetic workloads
- Additional pipeline edge-case generation such as skew, timestamp boundaries, and duplicate-heavy batches
- Broader dialect coverage and richer schema inference for advanced database features
Apache License 2.0. See LICENSE.