Skip to content

feat: Update SoF benchmark runner to benchmark contract v2#2641

Closed
piotrszul wants to merge 4 commits into
spike/sof-benchmarkfrom
feat/sof-benchmark-contract-v2
Closed

feat: Update SoF benchmark runner to benchmark contract v2#2641
piotrszul wants to merge 4 commits into
spike/sof-benchmarkfrom
feat/sof-benchmark-contract-v2

Conversation

@piotrszul

Copy link
Copy Markdown
Collaborator

Why

The SQL-on-FHIR benchmark contract advanced to contract v2 (submodule 343f398), splitting authored intent (the benchmark file) from generated facts (a committed sibling checkfile) and reshaping the report. Against v2 data the previous runner could no longer locate the data, silently lost its correctness check, and emitted a report rejected by benchmark-report.schema.json.

This brings both the Java (SofBenchmarkRunner) and Python (sof_runner.py) runners onto contract v2. Follows the accepted implementer feedback in sql-on-fhir #18/#20.

What changed

  • Deterministic data location at <dataRoot>/<name>/<version>/<size>/ from the dataset's explicit (name, version) identity — replaces ManifestLocator (deleted).
  • Checkfile-sourced correctness: expected row counts come from <benchmark>.check.json assertions[caseId][size], keyed by a stable case id; inline expectCount is gone.
  • countVariancePermitted honoured — a permitted case never reports count_mismatch.
  • Dataset integrity: loaded NDJSON is verified against the checkfile's per-file sha256 lock before benchmarking.
  • v2 report shape: structured implementation{engine: Pathling, binding: pathling-java|pathling-python}, benchmark{name,version} + dataset{name,version} provenance (scalar benchmarkVersion removed), required measurement.scenario = preloaded_repeated with sink = csv, stats = {mean, stddev, min, max, median}, per-size resourceCounts, and results keyed by the authored suite name with cases keyed by id.
  • Submodule pointer bumped to 343f398.

New/removed classes

New: Checkfile, DataLocator, ChecksumVerifier, CorrectnessGuard, Statistics (+ Python equivalents). Removed: ManifestLocator.

Testing

  • Java: mvn -pl sof-benchmark test — spotless clean, 0 checkstyle violations, 20 tests pass.
  • Python: pytest sof-benchmark/python/test_sof_runner.py10 tests pass.
  • Fixtures (v2 benchmark + checkfile + tiny NDJSON with real checksums) and the vendored report schema live under src/test/resources/contract-v2/.

Scope note: tests cover all pure logic (parsing, location, checkfile, sha256, correctness, report shape) without Spark. The end-to-end measurement path (Spark load → view execute → CSV extract over materialized data) is wired but not exercised by a test, since it needs a Synthea-materialized dataset (bun run data …) — worth a manual smoke-run before merge.

OpenSpec

Authored as change update-sof-benchmark-contract-v2 (proposal / design / spec delta / tasks); openspec validate --strict passes.

🤖 Generated with Claude Code

piotrszul and others added 3 commits July 1, 2026 20:53
The SQL-on-FHIR benchmark contract advanced to contract v2, splitting
authored intent (the benchmark file) from generated facts (a committed
sibling checkfile) and reshaping the report. Against v2 data the previous
runner could not locate the data, silently lost its correctness check, and
emitted a report rejected by the v2 report schema.

Bring both the Java and Python runners onto contract v2: resolve data
deterministically by dataset name/version/size, source expected row counts
from the checkfile keyed by a stable case id, honour countVariancePermitted,
verify loaded NDJSON against the checkfile sha256 lock, and emit the v2
report shape (structured engine/binding identity, benchmark and dataset
provenance, a required preloaded_repeated scenario with a csv sink, the fixed
{mean,stddev,min,max,median} stats set, per-size resource counts, and results
keyed by the authored suite name).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-implementation cleanup with no behaviour change:
- Collapse the checkfile's three near-identical nested-map parsers into a
  single generic helper.
- Use java.util.HexFormat for sha256 hex encoding instead of a hand-rolled
  encoder.
- Resolve the Pathling version once in the Python runner rather than twice.
- Simplify the Python checkfile sibling resolution with str.removesuffix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the click and pathling imports back to module top and un-nest the click
command from main(), reversing a regression that placed imports inside a
function. Restore the PathlingContext/DataSource type hints dropped from the
runner's function signatures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@piotrszul

Copy link
Copy Markdown
Collaborator Author

End-to-end smoke test — passed ✅

Ran both runners against clinical-flat.json over locally-materialized Synthea data (Synthea 3.2.0, sizes s and m).

Data reproducibility: materialized NDJSON is byte-identical to the committed checkfile lock — all four sha256s (Condition/Observation × s/m) match exactly, confirming the TZ=UTC pinning.

Runners (size s):

  • Deterministic location synthea-clinical/1/s resolved ✓
  • Checkfile sha256 integrity verified
  • condition-flatok (6406, matches checkfile) ✓
  • observation-componentsok (4366, matches checkfile) ✓
  • Reports carry the v2 shape: implementation.engine=Pathling + binding (pathling-java/pathling-python), benchmark/dataset provenance, scenario=preloaded_repeated, sink=csv, stats={mean,stddev,min,max,median}, per-size resourceCounts, results keyed by suite name, cases by id.

Parity: the two reports are structurally identical apart from implementation.binding, environment, and timings — except the engine version string (9.9.0-SNAPSHOT in Java vs 9.9.0.dev0 in Python). Same engine build; the strings differ because each binding resolves its own ecosystem version (per design.md decision #4). Timings land within noise of each other (~155–175 ms/case), as expected for a shared JVM engine differing only by the py4j binding.

Note: the countVariancePermitted path is covered by unit tests (the active-conditions fixture case); the upstream clinical-flat.json has only the two projection cases.

Apply the delta spec into the main sof-benchmark-runner capability spec so it
describes the implemented contract-v2 behaviour (deterministic data location,
checkfile-sourced assertions and sha256 integrity, csv-sink preloaded_repeated
timing, and the v2 report shape), and move the completed OpenSpec change to the
archive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
piotrszul added a commit that referenced this pull request Jul 1, 2026
Bring the Java and Python SQL-on-FHIR benchmark runners onto contract v2
(sql-on-fhir submodule 343f398):

- Resolve materialized data deterministically at <dataRoot>/<name>/<version>/<size>/,
  replacing the manifest scan (ManifestLocator removed).
- Source expected row counts from the sibling checkfile assertions keyed by a
  stable case id; honour countVariancePermitted (never flag count_mismatch).
- Verify loaded NDJSON against the checkfile sha256 lock before benchmarking.
- Emit the v2 report shape: structured implementation engine/binding identity,
  benchmark and dataset provenance (scalar benchmarkVersion removed), required
  preloaded_repeated scenario with a csv sink, the fixed {mean,stddev,min,max,median}
  stats set, per-size resourceCounts, and results keyed by the authored suite name.

Includes new Checkfile/DataLocator/ChecksumVerifier/CorrectnessGuard/Statistics
helpers with unit tests, mirrored Python runner and tests, updated docs, and the
archived OpenSpec change synced into the sof-benchmark-runner capability spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@piotrszul

Copy link
Copy Markdown
Collaborator Author

Squash-merged locally into spike/sof-benchmark as 7f24da4e9a (775ab41e22..7f24da4e9a). Closing since the local squash commit doesn't match the PR commits for GitHub auto-detection. All 4 branch commits are folded into that single commit; content is identical to this PR's head.

@piotrszul piotrszul closed this Jul 1, 2026
@piotrszul piotrszul deleted the feat/sof-benchmark-contract-v2 branch July 1, 2026 23:59
@sonarqubecloud

sonarqubecloud Bot commented Jul 2, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
72.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant