Skip to content

hyangminj/ddl2data

Repository files navigation

ddl2data

Turn any SQL schema into realistic test data — instantly.

  • PyPI package: ddl2data
  • CLI command: ddl2data
  • Python module: ddl2data

ddl2data turns SQL DDL, a live relational schema, or a live DynamoDB table schema into fake but structured data that is useful for load tests, end-to-end pipeline checks, local development, and staging environment seeding.

For relational schemas it parses tables and foreign keys, generates rows in parent-before-child order, applies type-aware defaults plus optional distributions, validates the generated data, and writes it out as SQL, JSON, CSV, or Parquet. It can also insert generated rows directly into a relational database through SQLAlchemy. For DynamoDB it can load key and index metadata from a real table and emit typed DynamoDB JSON payloads.


Why use it

Use ddl2data when you need to:

  • stress a pipeline or service with large synthetic datasets
  • generate relational fixtures from existing DDL
  • seed staging or local environments while preserving foreign-key order
  • inspect a live schema and generate data without hand-writing metadata
  • generate DynamoDB-shaped typed JSON from a real table definition
  • export the same dataset shape in SQL, JSON, CSV, Parquet, or DynamoDB JSON
  • produce validation reports before loading data elsewhere

How it works

Schema input
  -> parse DDL, inspect a live relational DB, or inspect a DynamoDB table
  -> build metadata and dependency order
  -> generate rows with Faker and optional distributions
  -> validate FK, null, unique, and optional CHECK constraints
  -> write SQL/JSON/CSV/Parquet/DynamoDB JSON or insert into a DB

Core modules:

  • ddl2data/parser/ddl.py: SQL DDL parsing via sqlglot
  • ddl2data/parser/introspect.py: relational schema introspection via SQLAlchemy
  • ddl2data/parser/dynamodb.py: DynamoDB schema loading via boto3
  • ddl2data/generator/base.py: main row generation engine
  • ddl2data/generator/dist.py: distribution parsing and sampling
  • ddl2data/writer/: SQL, JSON, CSV, Parquet, and DynamoDB JSON writers
  • ddl2data/validation.py and ddl2data/report.py: validation and reporting

Core capabilities

  • Input sources:
    • SQL DDL via --ddl
    • live relational schema introspection via --schema-from-db --db-url ...
    • live DynamoDB table schema via --schema-from-dynamodb --dynamodb-table ...
  • Relationship-aware generation:
    • foreign-key dependency graph
    • parent tables generated before child tables
    • FK-only tables can use the Polars generation path
  • Output formats:
    • PostgreSQL INSERT
    • MySQL INSERT
    • SQLite INSERT
    • BigQuery INSERT and INSERT ALL
    • JSON
    • CSV
    • Parquet
    • DynamoDB typed JSON
  • Validation and reporting:
    • FK integrity checks
    • non-null checks
    • unique collision checks
    • optional CHECK validation with --strict-checks
    • JSON report output via --report-path
  • Generation controls:
    • per-table row overrides with --table-rows
    • reproducible runs with --seed
    • config-file driven runs with JSON, TOML, or YAML
    • per-column and per-table distribution overrides with --dist
    • optional python or polars engine selection

Install

Recommended for CLI use:

pipx install ddl2data

Or install with pip:

pip install ddl2data

Optional Polars support:

pip install "ddl2data[polars]"

DynamoDB schema loading requires boto3:

pip install boto3

From source:

git clone https://github.com/hyangminj/ddl2data.git
cd ddl2data
python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e .

Contributor setup with test and Polars extras:

.venv/bin/python -m pip install -e ".[test,polars]"

Sanity check:

ddl2data --help

GitHub Releases also include wheel (.whl) and source (.tar.gz) artifacts.


Quick start

Minimal schema:

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(100) UNIQUE NOT NULL,
  age INT,
  tier VARCHAR(10)
);

CREATE TABLE orders (
  id INT PRIMARY KEY,
  user_id INT NOT NULL,
  amount NUMERIC,
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Generate JSON:

ddl2data --ddl schema.sql --rows 100 --out json --output-path data.json

Generate PostgreSQL inserts:

ddl2data --ddl schema.sql --rows 100 --out postgres --output-path seed.sql

Generate CSV files, one per table:

ddl2data --ddl schema.sql --rows 100 --out csv --output-path ./csv_out

With this schema, users is generated first and orders.user_id is filled from generated users.id values.


Common workflows

Generate SQL for different targets

ddl2data --ddl schema.sql --rows 100 --out postgres
ddl2data --ddl schema.sql --rows 100 --out mysql
ddl2data --ddl schema.sql --rows 100 --out sqlite
ddl2data --ddl schema.sql --rows 100 --out bigquery

Read schema directly from a relational database

ddl2data \
  --schema-from-db \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
  --rows 100 \
  --out postgres

Limit introspection to selected tables:

ddl2data \
  --schema-from-db \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb \
  --tables users,orders,events \
  --rows 100 \
  --out json

Insert generated rows directly into a relational database

ddl2data \
  --ddl schema.sql \
  --rows 1000 \
  --insert \
  --db-url postgresql+psycopg://user:pass@localhost:5432/mydb

Generate DynamoDB typed JSON from a live table schema

ddl2data \
  --schema-from-dynamodb \
  --dynamodb-table users \
  --dynamodb-region us-east-1 \
  --dynamodb-extra-attr email:string \
  --dynamodb-extra-attr score:int \
  --rows 100 \
  --out dynamodb-json \
  --output-path users.jsonl

Notes:

  • key attributes and GSI or LSI key attributes are inferred from the live table
  • use --dynamodb-extra-attr name:type to add non-key fields to generated output
  • supported extra attribute aliases include string, uuid, date, datetime, numeric, float, int, and boolean

Control row counts per table

ddl2data \
  --ddl schema.sql \
  --rows 100 \
  --table-rows users=20,orders=500 \
  --table-rows events=2000 \
  --out json
  • --rows stays the global default
  • --table-rows table=count overrides individual tables
  • config files can use a table_rows map

Generate a validation report

ddl2data \
  --ddl schema.sql \
  --rows 500 \
  --strict-checks \
  --report-path report.json \
  --out json \
  --output-path data.json

The report includes counts and sample issues for:

  • FK violations
  • non-null violations
  • unique collisions
  • supported CHECK violations when --strict-checks is enabled

Use Parquet or the Polars engine

Parquet output:

ddl2data \
  --ddl schema.sql \
  --rows 100 \
  --out parquet \
  --parquet-compression zstd \
  --output-path ./parquet_out

Polars generation and write path:

ddl2data \
  --ddl schema.sql \
  --rows 100000 \
  --out csv \
  --engine polars \
  --output-path ./csv_out

Engine behavior:

  • tables with CHECK constraints fall back to the Python row-wise generator
  • tables with unique or primary-key constraints fall back to the Python row-wise generator
  • FK-only tables can remain on the Polars path
  • --out parquet requires the optional polars dependency regardless of --engine

BigQuery-specific output

ddl2data --ddl schema.sql --rows 100 --out bigquery --output-path out.sql
ddl2data --ddl schema.sql --rows 100 --out bigquery --bq-insert-all --output-path out_insert_all.sql

BigQuery output supports:

  • dataset-qualified table names like dataset.table
  • INSERT ALL rendering via --bq-insert-all
  • typed DATE and TIMESTAMP literals

Config file example

ddl2data --config ddl2data.toml

Example ddl2data.toml:

ddl = "schema.sql"
rows = 500
out = "postgres"
seed = 42
strict_checks = true
table_rows = { users = 100, orders = 500, events = 2000 }
dist = [
  "users.age:normal,mean=33,std=7",
  "orders.amount:pareto,alpha=1.7,xm=1",
]

CLI arguments override config file values.


Distribution overrides

Syntax:

--dist <column_or_table.column>:<kind>,k=v,k=v

Supported distribution kinds:

  • normal
  • poisson
  • weighted
  • exponential
  • pareto
  • zipf
  • peak

Examples:

# global column rule
--dist age:normal,mean=35,std=7

# table-qualified rule (higher priority than global)
--dist users.age:normal,mean=30,std=6

# poisson counts
--dist daily_orders:poisson,lambda=3

# weighted categorical values
--dist tier:weighted,A=60%,B=30%,C=10%

# heavy-tail behavior
--dist amount:pareto,alpha=1.5,xm=1
--dist category_rank:zipf,skew=1.8,n=200

# peak-hour timestamps
--dist created_at:peak,hours=9-11,18-20

Priority when both exist:

  1. table.column
  2. column

CHECK-aware generation

ddl2data uses practical heuristics for common CHECK constraints during generation and can optionally validate them again with --strict-checks.

Supported forms include:

  • numeric comparison: age >= 18, qty > 0, score <= 100, x != 0
  • ranges: price BETWEEN 10 AND 20
  • enum lists: status IN ('A', 'B', 'C')
  • regex-like checks: code ~ '^[A-Z]{2}$', REGEXP_LIKE(code, '^[0-9]+$')
  • nested compound forms built from supported expressions with AND and OR

This is intentionally best-effort, not a full SQL expression engine.


Reproducible runs

Use --seed to stabilize Python random and Faker output:

ddl2data --ddl schema.sql --rows 100 --seed 42 --out json

Development and testing

Local setup

python3 -m venv .venv
. .venv/bin/activate
.venv/bin/python -m pip install -e ".[test,polars]"

If you only need the core package:

.venv/bin/python -m pip install -e .

If you need DynamoDB schema loading outside the test extra:

.venv/bin/python -m pip install boto3

Useful test commands

Run the full suite:

.venv/bin/python -m pytest

Run only non-integration tests:

.venv/bin/python -m pytest -m "not integration"

Run one file:

.venv/bin/python -m pytest tests/test_parser_graph.py

Run one integration target:

.venv/bin/python -m pytest tests/test_integration_postgres.py
.venv/bin/python -m pytest tests/test_integration_dynamodb.py
.venv/bin/python -m pytest tests/test_integration_bigquery.py

Available markers:

  • integration
  • postgres
  • dynamodb
  • bigquery

Local integration services

The repo includes docker-compose.yml for PostgreSQL and LocalStack DynamoDB:

docker compose up -d postgres localstack
docker compose ps

Service summary:

  • PostgreSQL: localhost:5432
  • LocalStack DynamoDB: http://localhost:4566

Suggested .env.test:

TEST_POSTGRES_URL=postgresql+psycopg2://testuser:testpass@localhost:5432/testdb
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
AWS_DEFAULT_REGION=us-east-1
DYNAMODB_ENDPOINT_URL=http://localhost:4566
TEST_BQ_PROJECT=your-gcp-project-id
TEST_BQ_DATASET=ddl2data_integration_test
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

.env.test is already ignored and is the right place for local test-only credentials.

BigQuery authentication

Two common options work for the integration tests:

Application Default Credentials:

gcloud auth application-default login

Service-account key file:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

If TEST_BQ_PROJECT is missing or credentials are unavailable, the BigQuery integration tests skip automatically.


CLI reference

ddl2data [--ddl schema.sql
        | --schema-from-db --db-url URL [--tables t1,t2]
        | --schema-from-dynamodb --dynamodb-table NAME]
        [--dynamodb-region REGION]
        [--dynamodb-extra-attr name:type]
        [--config config.toml]
        [--rows 100]
        [--table-rows users=20,orders=500] [--table-rows events=2000]
        [--out postgres|mysql|sqlite|bigquery|json|csv|parquet|dynamodb-json]
        [--engine python|polars]
        [--bq-insert-all]
        [--output-path PATH]
        [--insert --db-url URL]
        [--dist ...] [--dist ...]
        [--seed INT]
        [--report-path report.json]
        [--strict-checks]
        [--parquet-compression snappy|zstd|lz4|gzip|none]

Flag summary:

  • --config: load defaults from JSON, TOML, YAML, or YML
  • --ddl: input DDL file
  • --schema-from-db: inspect relational table metadata from a live database
  • --tables: optional comma-separated table filter for relational introspection mode
  • --schema-from-dynamodb: inspect a live DynamoDB table definition
  • --dynamodb-table: DynamoDB table name for --schema-from-dynamodb
  • --dynamodb-region: AWS region for DynamoDB schema loading
  • --dynamodb-extra-attr: add synthetic non-key DynamoDB attributes, repeatable
  • --rows: global default rows per table
  • --table-rows: per-table row overrides
  • --out: output format, default postgres
  • --engine: python or polars
  • --bq-insert-all: BigQuery INSERT ALL ... SELECT 1; mode
  • --output-path: output file path or output directory depending on format
  • --db-url: SQLAlchemy database URL for introspection or direct insert
  • --insert: insert generated rows directly into the target relational database
  • --dist: distribution overrides
  • --seed: deterministic generation seed
  • --report-path: write a JSON report with validation summary
  • --strict-checks: validate supported CHECK constraints after generation
  • --parquet-compression: Parquet compression codec

Limitations

  • DDL parsing and relational DB introspection cover common schemas well, but advanced dialect-specific features are still partial.
  • DynamoDB schema loading models key and index attributes plus explicitly declared extra attributes; it does not infer full document structure from item samples.
  • CHECK-aware generation and --strict-checks cover a practical subset, not arbitrary SQL expressions.
  • Function-heavy or cross-column CHECK expressions are best-effort rather than fully modeled.
  • --engine polars still falls back to Python for tables with CHECK, unique, or primary-key constraints.

Future work

  • More distribution types for realistic synthetic workloads
  • Additional pipeline edge-case generation such as skew, timestamp boundaries, and duplicate-heavy batches
  • Broader dialect coverage and richer schema inference for advanced database features

License

Apache License 2.0. See LICENSE.

Packages

 
 
 

Contributors

Languages