feat: Update SoF benchmark runner to benchmark contract v2 by piotrszul · Pull Request #2641 · aehrc/pathling

piotrszul · 2026-07-01T10:54:17Z

Why

The SQL-on-FHIR benchmark contract advanced to contract v2 (submodule 343f398), splitting authored intent (the benchmark file) from generated facts (a committed sibling checkfile) and reshaping the report. Against v2 data the previous runner could no longer locate the data, silently lost its correctness check, and emitted a report rejected by benchmark-report.schema.json.

This brings both the Java (SofBenchmarkRunner) and Python (sof_runner.py) runners onto contract v2. Follows the accepted implementer feedback in sql-on-fhir #18/#20.

What changed

Deterministic data location at <dataRoot>/<name>/<version>/<size>/ from the dataset's explicit (name, version) identity — replaces ManifestLocator (deleted).
Checkfile-sourced correctness: expected row counts come from <benchmark>.check.json assertions[caseId][size], keyed by a stable case id; inline expectCount is gone.
countVariancePermitted honoured — a permitted case never reports count_mismatch.
Dataset integrity: loaded NDJSON is verified against the checkfile's per-file sha256 lock before benchmarking.
v2 report shape: structured implementation{engine: Pathling, binding: pathling-java|pathling-python}, benchmark{name,version} + dataset{name,version} provenance (scalar benchmarkVersion removed), required measurement.scenario = preloaded_repeated with sink = csv, stats = {mean, stddev, min, max, median}, per-size resourceCounts, and results keyed by the authored suite name with cases keyed by id.
Submodule pointer bumped to 343f398.

New/removed classes

New: Checkfile, DataLocator, ChecksumVerifier, CorrectnessGuard, Statistics (+ Python equivalents). Removed: ManifestLocator.

Testing

Java: mvn -pl sof-benchmark test — spotless clean, 0 checkstyle violations, 20 tests pass.
Python: pytest sof-benchmark/python/test_sof_runner.py — 10 tests pass.
Fixtures (v2 benchmark + checkfile + tiny NDJSON with real checksums) and the vendored report schema live under src/test/resources/contract-v2/.

Scope note: tests cover all pure logic (parsing, location, checkfile, sha256, correctness, report shape) without Spark. The end-to-end measurement path (Spark load → view execute → CSV extract over materialized data) is wired but not exercised by a test, since it needs a Synthea-materialized dataset (bun run data …) — worth a manual smoke-run before merge.

OpenSpec

Authored as change update-sof-benchmark-contract-v2 (proposal / design / spec delta / tasks); openspec validate --strict passes.

🤖 Generated with Claude Code

The SQL-on-FHIR benchmark contract advanced to contract v2, splitting authored intent (the benchmark file) from generated facts (a committed sibling checkfile) and reshaping the report. Against v2 data the previous runner could not locate the data, silently lost its correctness check, and emitted a report rejected by the v2 report schema. Bring both the Java and Python runners onto contract v2: resolve data deterministically by dataset name/version/size, source expected row counts from the checkfile keyed by a stable case id, honour countVariancePermitted, verify loaded NDJSON against the checkfile sha256 lock, and emit the v2 report shape (structured engine/binding identity, benchmark and dataset provenance, a required preloaded_repeated scenario with a csv sink, the fixed {mean,stddev,min,max,median} stats set, per-size resource counts, and results keyed by the authored suite name). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Post-implementation cleanup with no behaviour change: - Collapse the checkfile's three near-identical nested-map parsers into a single generic helper. - Use java.util.HexFormat for sha256 hex encoding instead of a hand-rolled encoder. - Resolve the Pathling version once in the Python runner rather than twice. - Simplify the Python checkfile sibling resolution with str.removesuffix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move the click and pathling imports back to module top and un-nest the click command from main(), reversing a regression that placed imports inside a function. Restore the PathlingContext/DataSource type hints dropped from the runner's function signatures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

piotrszul · 2026-07-01T11:42:21Z

End-to-end smoke test — passed ✅

Ran both runners against clinical-flat.json over locally-materialized Synthea data (Synthea 3.2.0, sizes s and m).

Data reproducibility: materialized NDJSON is byte-identical to the committed checkfile lock — all four sha256s (Condition/Observation × s/m) match exactly, confirming the TZ=UTC pinning.

Runners (size s):

Deterministic location synthea-clinical/1/s resolved ✓
Checkfile sha256 integrity verified ✓
condition-flat → ok (6406, matches checkfile) ✓
observation-components → ok (4366, matches checkfile) ✓
Reports carry the v2 shape: implementation.engine=Pathling + binding (pathling-java/pathling-python), benchmark/dataset provenance, scenario=preloaded_repeated, sink=csv, stats={mean,stddev,min,max,median}, per-size resourceCounts, results keyed by suite name, cases by id.

Parity: the two reports are structurally identical apart from implementation.binding, environment, and timings — except the engine version string (9.9.0-SNAPSHOT in Java vs 9.9.0.dev0 in Python). Same engine build; the strings differ because each binding resolves its own ecosystem version (per design.md decision #4). Timings land within noise of each other (~155–175 ms/case), as expected for a shared JVM engine differing only by the py4j binding.

Note: the countVariancePermitted path is covered by unit tests (the active-conditions fixture case); the upstream clinical-flat.json has only the two projection cases.

Apply the delta spec into the main sof-benchmark-runner capability spec so it describes the implemented contract-v2 behaviour (deterministic data location, checkfile-sourced assertions and sha256 integrity, csv-sink preloaded_repeated timing, and the v2 report shape), and move the completed OpenSpec change to the archive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bring the Java and Python SQL-on-FHIR benchmark runners onto contract v2 (sql-on-fhir submodule 343f398): - Resolve materialized data deterministically at <dataRoot>/<name>/<version>/<size>/, replacing the manifest scan (ManifestLocator removed). - Source expected row counts from the sibling checkfile assertions keyed by a stable case id; honour countVariancePermitted (never flag count_mismatch). - Verify loaded NDJSON against the checkfile sha256 lock before benchmarking. - Emit the v2 report shape: structured implementation engine/binding identity, benchmark and dataset provenance (scalar benchmarkVersion removed), required preloaded_repeated scenario with a csv sink, the fixed {mean,stddev,min,max,median} stats set, per-size resourceCounts, and results keyed by the authored suite name. Includes new Checkfile/DataLocator/ChecksumVerifier/CorrectnessGuard/Statistics helpers with unit tests, mirrored Python runner and tests, updated docs, and the archived OpenSpec change synced into the sof-benchmark-runner capability spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

piotrszul · 2026-07-01T23:57:15Z

Squash-merged locally into spike/sof-benchmark as 7f24da4e9a (775ab41e22..7f24da4e9a). Closing since the local squash commit doesn't match the PR commits for GitHub auto-detection. All 4 branch commits are folded into that single commit; content is identical to this PR's head.

sonarqubecloud · 2026-07-02T00:24:02Z

Quality Gate failed

Failed conditions
72.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

piotrszul and others added 3 commits July 1, 2026 20:53

piotrszul closed this Jul 1, 2026

piotrszul deleted the feat/sof-benchmark-contract-v2 branch July 1, 2026 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Update SoF benchmark runner to benchmark contract v2#2641

feat: Update SoF benchmark runner to benchmark contract v2#2641
piotrszul wants to merge 4 commits into
spike/sof-benchmarkfrom
feat/sof-benchmark-contract-v2

piotrszul commented Jul 1, 2026

Uh oh!

piotrszul commented Jul 1, 2026

Uh oh!

piotrszul commented Jul 1, 2026

Uh oh!

sonarqubecloud Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

piotrszul commented Jul 1, 2026

Why

What changed

New/removed classes

Testing

OpenSpec

Uh oh!

piotrszul commented Jul 1, 2026

End-to-end smoke test — passed ✅

Uh oh!

piotrszul commented Jul 1, 2026

Uh oh!

sonarqubecloud Bot commented Jul 2, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant