ddl2data

Turn any SQL schema into realistic test data — instantly.

PyPI package: ddl2data
CLI command: ddl2data
Python module: ddl2data

ddl2data turns SQL DDL, a live relational schema, or a live DynamoDB table schema into fake but structured data that is useful for load tests, end-to-end pipeline checks, local development, and staging environment seeding.

For relational schemas it parses tables and foreign keys, generates rows in parent-before-child order, applies type-aware defaults plus optional distributions, validates the generated data, and writes it out as SQL, JSON, CSV, or Parquet. It can also insert generated rows directly into a relational database through SQLAlchemy. For DynamoDB it can load key and index metadata from a real table and emit typed DynamoDB JSON payloads.

Why use it

Use ddl2data when you need to:

stress a pipeline or service with large synthetic datasets
generate relational fixtures from existing DDL
seed staging or local environments while preserving foreign-key order
inspect a live schema and generate data without hand-writing metadata
generate DynamoDB-shaped typed JSON from a real table definition
export the same dataset shape in SQL, JSON, CSV, Parquet, or DynamoDB JSON
produce validation reports before loading data elsewhere

How it works

Schema input
  -> parse DDL, inspect a live relational DB, or inspect a DynamoDB table
  -> build metadata and dependency order
  -> generate rows with Faker and optional distributions
  -> validate FK, null, unique, and optional CHECK constraints
  -> write SQL/JSON/CSV/Parquet/DynamoDB JSON or insert into a DB

Core modules:

ddl2data/parser/ddl.py: SQL DDL parsing via sqlglot
ddl2data/parser/introspect.py: relational schema introspection via SQLAlchemy
ddl2data/parser/dynamodb.py: DynamoDB schema loading via boto3
ddl2data/generator/base.py: main row generation engine
ddl2data/generator/dist.py: distribution parsing and sampling
ddl2data/writer/: SQL, JSON, CSV, Parquet, and DynamoDB JSON writers
ddl2data/validation.py and ddl2data/report.py: validation and reporting

Core capabilities

Input sources:
- SQL DDL via --ddl
- live relational schema introspection via --schema-from-db --db-url ...
- live DynamoDB table schema via --schema-from-dynamodb --dynamodb-table ...
Relationship-aware generation:
- foreign-key dependency graph
- parent tables generated before child tables
- FK-only tables can use the Polars generation path
Output formats:
- PostgreSQL INSERT
- MySQL INSERT
- SQLite INSERT
- BigQuery INSERT and INSERT ALL
- JSON
- CSV
- Parquet
- DynamoDB typed JSON
Validation and reporting:
- FK integrity checks
- non-null checks
- unique collision checks
- optional CHECK validation with --strict-checks
- JSON report output via --report-path
Generation controls:
- per-table row overrides with --table-rows
- reproducible runs with --seed
- config-file driven runs with JSON, TOML, or YAML
- per-column and per-table distribution overrides with --dist
- optional python or polars engine selection

Install

Recommended for CLI use:

pipx install ddl2data

Or install with pip:

pip install ddl2data

Optional Polars support:

pip install "ddl2data[polars]"

DynamoDB schema loading requires boto3:

pip install boto3

From source:

git clone https://github.com/hyangminj/ddl2data.git
cd ddl2data
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e .

Contributor setup with test and Polars extras:

.venv/bin/python -m pip install -e ".[test,polars]"

Sanity check:

ddl2data --help

GitHub Releases also include wheel (.whl) and source (.tar.gz) artifacts.

Quick start

Minimal schema:

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(100) UNIQUE NOT NULL,
  age INT,
  tier VARCHAR(10)
);

CREATE TABLE orders (
  id INT PRIMARY KEY,
  user_id INT NOT NULL,
  amount NUMERIC,
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Generate JSON:

ddl2data --ddl schema.sql --rows 100 --out json --output-path data.json

Generate PostgreSQL inserts:

ddl2data --ddl schema.sql --rows 100 --out postgres --output-path seed.sql

Generate CSV files, one per table:

ddl2data --ddl schema.sql --rows 100 --out csv --output-path ./csv_out

With this schema, users is generated first and orders.user_id is filled from generated users.id values.

Common workflows

Generate SQL for different targets

ddl2data --ddl schema.sql --rows 100 --out postgres
ddl2data --ddl schema.sql --rows 100 --out mysql
ddl2data --ddl schema.sql --rows 100 --out sqlite
ddl2data --ddl schema.sql --rows 100 --out bigquery

Read schema directly from a relational database

ddl2data \
  --schema-from-db \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
  --rows 100 \
  --out postgres

Limit introspection to selected tables:

ddl2data \
  --schema-from-db \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
  --tables users,orders,events \
  --rows 100 \
  --out json

Insert generated rows directly into a relational database

ddl2data \
  --ddl schema.sql \
  --rows 1000 \
  --insert \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb

Generate DynamoDB typed JSON from a live table schema

ddl2data \
  --schema-from-dynamodb \
  --dynamodb-table users \
  --dynamodb-region us-east-1 \
  --dynamodb-extra-attr email:string \
  --dynamodb-extra-attr score:int \
  --rows 100 \
  --out dynamodb-json \
  --output-path users.jsonl

Notes:

key attributes and GSI or LSI key attributes are inferred from the live table
use --dynamodb-extra-attr name:type to add non-key fields to generated output
supported extra attribute aliases include string, uuid, date, datetime, numeric, float, int, and boolean

Control row counts per table

ddl2data \
  --ddl schema.sql \
  --rows 100 \
  --table-rows users=20,orders=500 \
  --table-rows events=2000 \
  --out json

--rows stays the global default
--table-rows table=count overrides individual tables
config files can use a table_rows map

Generate a validation report

ddl2data \
  --ddl schema.sql \
  --rows 500 \
  --strict-checks \
  --report-path report.json \
  --out json \
  --output-path data.json

The report includes counts and sample issues for:

FK violations
non-null violations
unique collisions
supported CHECK violations when --strict-checks is enabled

Use Parquet or the Polars engine

Parquet output:

ddl2data \
  --ddl schema.sql \
  --rows 100 \
  --out parquet \
  --parquet-compression zstd \
  --output-path ./parquet_out

Polars generation and write path:

ddl2data \
  --ddl schema.sql \
  --rows 100000 \
  --out csv \
  --engine polars \
  --output-path ./csv_out

Engine behavior:

tables with CHECK constraints fall back to the Python row-wise generator
tables with unique or primary-key constraints fall back to the Python row-wise generator
FK-only tables can remain on the Polars path
--out parquet requires the optional polars dependency regardless of --engine

BigQuery-specific output

ddl2data --ddl schema.sql --rows 100 --out bigquery --output-path out.sql
ddl2data --ddl schema.sql --rows 100 --out bigquery --bq-insert-all --output-path out_insert_all.sql

BigQuery output supports:

dataset-qualified table names like dataset.table
INSERT ALL rendering via --bq-insert-all
typed DATE and TIMESTAMP literals

Config file example

ddl2data --config ddl2data.toml

Example ddl2data.toml:

ddl = "schema.sql"
rows = 500
out = "postgres"
seed = 42
strict_checks = true
table_rows = { users = 100, orders = 500, events = 2000 }
dist = [
  "users.age:normal,mean=33,std=7",
  "orders.amount:pareto,alpha=1.7,xm=1",
]

CLI arguments override config file values.

Distribution overrides

Syntax:

--dist <column_or_table.column>:<kind>,k=v,k=v

Supported distribution kinds:

normal
poisson
weighted
exponential
pareto
zipf
peak

Examples:

# global column rule
--dist age:normal,mean=35,std=7

# table-qualified rule (higher priority than global)
--dist users.age:normal,mean=30,std=6

# poisson counts
--dist daily_orders:poisson,lambda=3

# weighted categorical values
--dist tier:weighted,A=60%,B=30%,C=10%

# heavy-tail behavior
--dist amount:pareto,alpha=1.5,xm=1
--dist category_rank:zipf,skew=1.8,n=200

# peak-hour timestamps
--dist created_at:peak,hours=9-11,18-20

Priority when both exist:

table.column
column

CHECK-aware generation

ddl2data uses practical heuristics for common CHECK constraints during generation and can optionally validate them again with --strict-checks.

Supported forms include:

numeric comparison: age >= 18, qty > 0, score <= 100, x != 0
ranges: price BETWEEN 10 AND 20
enum lists: status IN ('A', 'B', 'C')
regex-like checks: code ~ '^[A-Z]{2}$', REGEXP_LIKE(code, '^[0-9]+$')
nested compound forms built from supported expressions with AND and OR

This is intentionally best-effort, not a full SQL expression engine.

Reproducible runs

Use --seed to stabilize Python random and Faker output:

ddl2data --ddl schema.sql --rows 100 --seed 42 --out json

Development and testing

Local setup

python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e ".[test,polars]"

If you only need the core package:

.venv/bin/python -m pip install -e .

If you need DynamoDB schema loading outside the test extra:

.venv/bin/python -m pip install boto3

Useful test commands

Run the full suite:

.venv/bin/python -m pytest

Run only non-integration tests:

.venv/bin/python -m pytest -m "not integration"

Run one file:

.venv/bin/python -m pytest tests/test_parser_graph.py

Run one integration target:

.venv/bin/python -m pytest tests/test_integration_postgres.py
.venv/bin/python -m pytest tests/test_integration_dynamodb.py
.venv/bin/python -m pytest tests/test_integration_bigquery.py

Available markers:

integration
postgres
dynamodb
bigquery

Local integration services

The repo includes docker-compose.yml for PostgreSQL and LocalStack DynamoDB:

docker compose up -d postgres localstack
docker compose ps

Service summary:

PostgreSQL: localhost:5432
LocalStack DynamoDB: http://localhost:4566

Suggested .env.test:

TEST_POSTGRES_URL=postgresql+psycopg2://testuser:testpass@localhost:5432/testdb
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
AWS_DEFAULT_REGION=us-east-1
DYNAMODB_ENDPOINT_URL=http://localhost:4566
TEST_BQ_PROJECT=your-gcp-project-id
TEST_BQ_DATASET=ddl2data_integration_test
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

.env.test is already ignored and is the right place for local test-only credentials.

BigQuery authentication

Two common options work for the integration tests:

Application Default Credentials:

gcloud auth application-default login

Service-account key file:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

If TEST_BQ_PROJECT is missing or credentials are unavailable, the BigQuery integration tests skip automatically.

CLI reference

ddl2data [--ddl schema.sql
        | --schema-from-db --db-url URL [--tables t1,t2]
        | --schema-from-dynamodb --dynamodb-table NAME]
        [--dynamodb-region REGION]
        [--dynamodb-extra-attr name:type]
        [--config config.toml]
        [--rows 100]
        [--table-rows users=20,orders=500] [--table-rows events=2000]
        [--out postgres|mysql|sqlite|bigquery|json|csv|parquet|dynamodb-json]
        [--engine python|polars]
        [--bq-insert-all]
        [--output-path PATH]
        [--insert --db-url URL]
        [--dist ...] [--dist ...]
        [--seed INT]
        [--report-path report.json]
        [--strict-checks]
        [--parquet-compression snappy|zstd|lz4|gzip|none]

Flag summary:

--config: load defaults from JSON, TOML, YAML, or YML
--ddl: input DDL file
--schema-from-db: inspect relational table metadata from a live database
--tables: optional comma-separated table filter for relational introspection mode
--schema-from-dynamodb: inspect a live DynamoDB table definition
--dynamodb-table: DynamoDB table name for --schema-from-dynamodb
--dynamodb-region: AWS region for DynamoDB schema loading
--dynamodb-extra-attr: add synthetic non-key DynamoDB attributes, repeatable
--rows: global default rows per table
--table-rows: per-table row overrides
--out: output format, default postgres
--engine: python or polars
--bq-insert-all: BigQuery INSERT ALL ... SELECT 1; mode
--output-path: output file path or output directory depending on format
--db-url: SQLAlchemy database URL for introspection or direct insert
--insert: insert generated rows directly into the target relational database
--dist: distribution overrides
--seed: deterministic generation seed
--report-path: write a JSON report with validation summary
--strict-checks: validate supported CHECK constraints after generation
--parquet-compression: Parquet compression codec

Limitations

DDL parsing and relational DB introspection cover common schemas well, but advanced dialect-specific features are still partial.
DynamoDB schema loading models key and index attributes plus explicitly declared extra attributes; it does not infer full document structure from item samples.
CHECK-aware generation and --strict-checks cover a practical subset, not arbitrary SQL expressions.
Function-heavy or cross-column CHECK expressions are best-effort rather than fully modeled.
--engine polars still falls back to Python for tables with CHECK, unique, or primary-key constraints.

Future work

More distribution types for realistic synthetic workloads
Additional pipeline edge-case generation such as skew, timestamp boundaries, and duplicate-heavy batches
Broader dialect coverage and richer schema inference for advanced database features

License

Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
ddl2data		ddl2data
tests		tests
.env.test.example		.env.test.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
schema.sql		schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ddl2data

Why use it

How it works

Core capabilities

Install

Quick start

Common workflows

Generate SQL for different targets

Read schema directly from a relational database

Insert generated rows directly into a relational database

Generate DynamoDB typed JSON from a live table schema

Control row counts per table

Generate a validation report

Use Parquet or the Polars engine

BigQuery-specific output

Config file example

Distribution overrides

CHECK-aware generation

Reproducible runs

Development and testing

Local setup

Useful test commands

Local integration services

BigQuery authentication

CLI reference

Limitations

Future work

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ddl2data

Why use it

How it works

Core capabilities

Install

Quick start

Common workflows

Generate SQL for different targets

Read schema directly from a relational database

Insert generated rows directly into a relational database

Generate DynamoDB typed JSON from a live table schema

Control row counts per table

Generate a validation report

Use Parquet or the Polars engine

BigQuery-specific output

Config file example

Distribution overrides

CHECK-aware generation

Reproducible runs

Development and testing

Local setup

Useful test commands

Local integration services

BigQuery authentication

CLI reference

Limitations

Future work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages