A plug-and-play extension for TPC-DS benchmarking. Inspired by duckdb and hyrise, it eliminates the cumbersome TPC-DS setup — install the extension and start benchmarking immediately.
Supports Apache Cloudberry (MPP) distributed architecture with two distribution modes:
- original (default): all 24 tables hash-distributed
- replicated: 14 dimension tables replicated, 10 fact tables hash-distributed
- Apache Cloudberry (PostgreSQL 14 kernel) or PostgreSQL
- CMake 3.12+
- C++17 compiler (GCC 7+ or Clang 5+)
Make sure pg_config is in your PATH.
cd contrib/pg_tpcds
cmake -Bbuild
cmake --build build --target install-- Install the extension (creates metadata schema only, no data tables yet)
CREATE EXTENSION pg_tpcds;
-- Create tables with hash distribution (default)
SELECT tpcds_prepare();
-- Or create tables with replicated dimension tables
SELECT tpcds_prepare('replicated');
-- Generate data for scale factor 1
SELECT * FROM dsdgen(1);
-- Collect answer files for result validation (uses optimizer=off)
SELECT tpcds_collect_answers();
-- Run all 99 queries
SELECT * FROM tpcds;-- Hash distribution (all tables hash-distributed by primary key)
SELECT tpcds_prepare();
-- or explicitly:
SELECT tpcds_prepare('original');
-- Replicated distribution (dimension tables replicated, fact tables hash-distributed)
SELECT tpcds_prepare('replicated');Switching modes drops and recreates all 24 data tables (and clears query stats):
-- Switch from hash to replicated (data must be reloaded)
SELECT tpcds_prepare('replicated');
SELECT * FROM dsdgen(1);
SELECT tpcds_collect_answers();-- Run all 99 queries and show benchmark results
SELECT * FROM tpcds;
Qid | Stable(ms) | Current(ms) | Diff(%) | Result
------+-------------+-------------+---------+--------
01 | 243.99 | 251.74 | +3.18 | true
02 | 1297.74 | 1525.93 | +17.58 | true
...
99 | 984.72 | 1023.06 | +3.89 | true
---- | ----------- | ----------- | ------- |
Sum | 134660.86 | 139667.93 | +3.72 |
-- Run specific queries by ID
SELECT * FROM tpcds(1, 2, 3);
-- Run a single query directly
SELECT query FROM tpcds_queries(1); \gset
:queryColumn descriptions:
| Column | Description |
|---|---|
| Stable(ms) | Best historical duration (from tpcds.tpcds_query_stats) |
| Current(ms) | Duration of the current run |
| Diff(%) | Percentage change: (Current - Stable) / Stable * 100 |
| Result | Whether the query result matches the expected answer set |
- On the first run,
Stable(ms)is empty — results are recorded as the baseline. - On subsequent runs, if the current duration is faster, it replaces the stored baseline.
Resultvalidates query output against answer files generated bytpcds_collect_answers().
Answer files are generated per distribution mode:
-- Generate answer files (auto-detects current distribution mode)
SELECT tpcds_collect_answers();
-- Returns 99 (number of answer files generated)- Runs all 99 queries with
optimizer=offfor deterministic results - Saves sorted, pipe-delimited output to source tree:
src/tpcds/answers/sf1/for hash distributionsrc/tpcds/answers/sf1_replicated/for replicated distribution
RunTPCDSauto-detects the current distribution mode and picks the correct answer directory
-- List all 99 TPC-DS queries
SELECT * FROM tpcds_queries();
-- Get a specific query by ID (1-99)
SELECT * FROM tpcds_queries(42);
-- Truncate all TPC-DS table data
SELECT tpcds_cleanup();
-- Remove the extension (drops all tables, functions, and schema)
DROP EXTENSION pg_tpcds CASCADE;- Queries: 99 SQL statements are provided in
src/tpcds/queries/. Modify them before building to customize the benchmark. - Schema: Table DDL files are in
src/tpcds/schema/. Distribution is applied dynamically at runtime fromdistribution.txtanddistribution_original.txt. - Post-setup: Add index creation or other post-load statements to
src/post_prepare.sql, which runs after table creation. - Distribution: Edit
src/tpcds/distribution.txt(replicated mode) orsrc/tpcds/distribution_original.txt(hash mode) to customize table distribution.