feat: Update SoF benchmark runner to benchmark contract v2#2641
feat: Update SoF benchmark runner to benchmark contract v2#2641piotrszul wants to merge 4 commits into
Conversation
The SQL-on-FHIR benchmark contract advanced to contract v2, splitting
authored intent (the benchmark file) from generated facts (a committed
sibling checkfile) and reshaping the report. Against v2 data the previous
runner could not locate the data, silently lost its correctness check, and
emitted a report rejected by the v2 report schema.
Bring both the Java and Python runners onto contract v2: resolve data
deterministically by dataset name/version/size, source expected row counts
from the checkfile keyed by a stable case id, honour countVariancePermitted,
verify loaded NDJSON against the checkfile sha256 lock, and emit the v2
report shape (structured engine/binding identity, benchmark and dataset
provenance, a required preloaded_repeated scenario with a csv sink, the fixed
{mean,stddev,min,max,median} stats set, per-size resource counts, and results
keyed by the authored suite name).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-implementation cleanup with no behaviour change: - Collapse the checkfile's three near-identical nested-map parsers into a single generic helper. - Use java.util.HexFormat for sha256 hex encoding instead of a hand-rolled encoder. - Resolve the Pathling version once in the Python runner rather than twice. - Simplify the Python checkfile sibling resolution with str.removesuffix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the click and pathling imports back to module top and un-nest the click command from main(), reversing a regression that placed imports inside a function. Restore the PathlingContext/DataSource type hints dropped from the runner's function signatures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
End-to-end smoke test — passed ✅Ran both runners against Data reproducibility: materialized NDJSON is byte-identical to the committed checkfile lock — all four sha256s (Condition/Observation × s/m) match exactly, confirming the TZ=UTC pinning. Runners (size
Parity: the two reports are structurally identical apart from Note: the |
Apply the delta spec into the main sof-benchmark-runner capability spec so it describes the implemented contract-v2 behaviour (deterministic data location, checkfile-sourced assertions and sha256 integrity, csv-sink preloaded_repeated timing, and the v2 report shape), and move the completed OpenSpec change to the archive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring the Java and Python SQL-on-FHIR benchmark runners onto contract v2
(sql-on-fhir submodule 343f398):
- Resolve materialized data deterministically at <dataRoot>/<name>/<version>/<size>/,
replacing the manifest scan (ManifestLocator removed).
- Source expected row counts from the sibling checkfile assertions keyed by a
stable case id; honour countVariancePermitted (never flag count_mismatch).
- Verify loaded NDJSON against the checkfile sha256 lock before benchmarking.
- Emit the v2 report shape: structured implementation engine/binding identity,
benchmark and dataset provenance (scalar benchmarkVersion removed), required
preloaded_repeated scenario with a csv sink, the fixed {mean,stddev,min,max,median}
stats set, per-size resourceCounts, and results keyed by the authored suite name.
Includes new Checkfile/DataLocator/ChecksumVerifier/CorrectnessGuard/Statistics
helpers with unit tests, mirrored Python runner and tests, updated docs, and the
archived OpenSpec change synced into the sof-benchmark-runner capability spec.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Squash-merged locally into |
|


Why
The SQL-on-FHIR benchmark contract advanced to contract v2 (submodule
343f398), splitting authored intent (the benchmark file) from generated facts (a committed sibling checkfile) and reshaping the report. Against v2 data the previous runner could no longer locate the data, silently lost its correctness check, and emitted a report rejected bybenchmark-report.schema.json.This brings both the Java (
SofBenchmarkRunner) and Python (sof_runner.py) runners onto contract v2. Follows the accepted implementer feedback in sql-on-fhir #18/#20.What changed
<dataRoot>/<name>/<version>/<size>/from the dataset's explicit(name, version)identity — replacesManifestLocator(deleted).<benchmark>.check.jsonassertions[caseId][size], keyed by a stable caseid; inlineexpectCountis gone.countVariancePermittedhonoured — a permitted case never reportscount_mismatch.sha256lock before benchmarking.implementation{engine: Pathling, binding: pathling-java|pathling-python},benchmark{name,version}+dataset{name,version}provenance (scalarbenchmarkVersionremoved), requiredmeasurement.scenario = preloaded_repeatedwithsink = csv,stats = {mean, stddev, min, max, median}, per-sizeresourceCounts, andresultskeyed by the authored suitenamewith cases keyed byid.343f398.New/removed classes
New:
Checkfile,DataLocator,ChecksumVerifier,CorrectnessGuard,Statistics(+ Python equivalents). Removed:ManifestLocator.Testing
mvn -pl sof-benchmark test— spotless clean, 0 checkstyle violations, 20 tests pass.pytest sof-benchmark/python/test_sof_runner.py— 10 tests pass.src/test/resources/contract-v2/.Scope note: tests cover all pure logic (parsing, location, checkfile, sha256, correctness, report shape) without Spark. The end-to-end measurement path (Spark load → view execute → CSV extract over materialized data) is wired but not exercised by a test, since it needs a Synthea-materialized dataset (
bun run data …) — worth a manual smoke-run before merge.OpenSpec
Authored as change
update-sof-benchmark-contract-v2(proposal / design / spec delta / tasks);openspec validate --strictpasses.🤖 Generated with Claude Code