perf(ci): increase add_measurements base case from 1 to 10 items by kaihowl · Pull Request #733 · kaihowl/git-perf

kaihowl · 2026-06-08T18:27:18Z

Summary

Replace the add_measurements benchmark's [1, 50, 100] item variants with a single 50-item variant, eliminating CI time spent on variants that are never audited
The 1-item variant had 13.8% between-runner CoV (subprocess startup variance dominated); 50 items averages across 50 subprocess calls, reducing CoV to ~8%
Add criterion-json mode to benchmark-study.yml so Criterion benchmarks can be studied directly (previously the workflow only supported git perf measure wall-clock timing)
Serialize study instances with max-parallel: 1 to prevent concurrent runner VMs from competing for CPU on shared physical hosts, which inflated CoV by ~4× in benchmarks
Calibrate add_measurement/50::median thresholds from a sequential 10-instance study: CoV=8.0%, MAD%=4.6% → min_relative_deviation=12, max_cov=16

Study results

Two studies were run:

Parallel (10 concurrent instances): CoV=34.8% — inflated by runner resource contention
Sequential (10 instances, max-parallel: 1): CoV=8.0% ✅ — matches historical CI variance

Test plan

CI passes with the new single add_measurement/50 variant
Audit activates after 3 CI runs accumulate past the epoch
Future benchmark studies use mode: criterion-json + max-parallel: 1 for accurate CoV measurement

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

Per-element time with a single item is dominated by subprocess startup variance (CoV 13.8%). Using 10 items per iteration averages across 10 independent subprocess calls, reducing CoV by ~√10. Update the audit measurement name, reset its epoch, and set conservative thresholds (min_relative_deviation=10, max_cov=20) pending a new calibration study. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

github-actions · 2026-06-08T18:32:15Z

🧬 PR Mutation Testing Results

Status: ⚠️ No mutation report generated
Note: This may indicate no mutants were generated from the diff.

📦 Artifacts

Download full mutation report

Running [10,50,100] variants wastes CI time; only the single audited variant needs to run. Historical CI data shows the 50-item variant has ~3.3% cross-runner CoV (vs 13.8% for 1 item), so it is retained as the sole variant. Update audit target to add_measurement/50, tighten thresholds (min_relative_deviation=7, max_cov=10) accordingly, and reset epoch for a clean baseline. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

github-actions · 2026-06-08T18:36:12Z

Performance Report

⏱ Performance Results

Audit Results

Auditing measurement "add-benchmark" (os=ubuntu-22.04/rust=beta):
  ⚠️  WARNING: Change points detected in current epoch for 'add-benchmark':
     commit 6e3baa3 (-36.1%)
     commit ee82970 (+63.3%)
     commit 5533710 (-39.0%)
     commit 8ef210c (+53.1%)
     commit 3541c6a (-38.1%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'add-benchmark'
  Aggregation: min
  z-score (mad): ↑ 1.04
  Head: μ: 23.7ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 21.8ms σ: 4ms MAD: 1.9ms n: 31
   [-38.10% – +10.13%] █▇▇█▆██▇█▇▂▂█▂▇▇█▆▇█▆▁▁▁▇▆▄▁▄▄▄▇

Auditing measurement "add-benchmark" (os=ubuntu-22.04/rust=stable):
  ⚠️  WARNING: Change points detected in current epoch for 'add-benchmark':
     commit b194e1c (-63.3%)
     commit c2dc7a7 (+187.3%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'add-benchmark'
  Aggregation: min
  z-score (mad): ↓ 2.94
  Head: μ: 15.6ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 21.3ms σ: 4.6ms MAD: 1.9ms n: 28
   [-64.47% – +10.59%] ▄▇▇▄█▇▁▇█▇▇█▇▇▆▇█▄▇▄▇▇▇▃▆▆▆▃▄

Auditing measurement "bench::add_measurements/add_measurement/50::median" (os=ubuntu-22.04/rust=beta):
  ⏭️ 'bench::add_measurements/add_measurement/50::median'
  Only 0 historical measurements found. Less than requested min_measurements of 3. Skipping test.
  Head: μ: 860.6ms σ: N/A MAD: 0ns n: 1

Auditing measurement "bench::add_measurements/add_measurement/50::median" (os=ubuntu-22.04/rust=stable):
  ⏭️ 'bench::add_measurements/add_measurement/50::median'
  Only 0 historical measurements found. Less than requested min_measurements of 3. Skipping test.
  Head: μ: 559ms σ: N/A MAD: 0ns n: 1

Auditing measurement "bench::report/report_generation/10::median" (os=ubuntu-22.04/rust=beta):
  ✅ 'bench::report/report_generation/10::median'
  Aggregation: median
  z-score (mad): ↓ 2.43
  Head: μ: 21.8ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 22.7ms σ: 840.5µs MAD: 355.4µs n: 7
   [-6.93% – +1.53%] ▇▁▃█▁▇▇▂

Auditing measurement "bench::report/report_generation/10::median" (os=ubuntu-22.04/rust=stable):
  ✅ 'bench::report/report_generation/10::median'
  Aggregation: median
  z-score (mad): ↓ 2.75
  Head: μ: 14.3ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 18.2ms σ: 6.3ms MAD: 1.4ms n: 7
   [-67.12% – +6.48%] ▄██▄█▇▁▄

Auditing measurement "release-binary-size" (os=ubuntu-22.04/rust=stable):
  ✅ 'release-binary-size'
  Aggregation: min
  z-score (stddev): ↑ 1.04
  Head: μ: 8.3MB σ: N/A MAD: 0B n: 1
  Tail: μ: 8.2MB σ: 78.6kB MAD: 35.3kB n: 36
   [-2.57% – +0.61%] ████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▅▁▂▂▂▂▂▂█

Auditing measurement "report" (os=ubuntu-22.04/rust=beta):
  ⚠️  WARNING: Change points detected in current epoch for 'report':
     commit 6e3baa3 (-36.1%)
     commit ee82970 (+61.2%)
     commit 5533710 (-37.7%)
     commit 8ef210c (+53.8%)
     commit 3541c6a (-34.4%)
     commit dc90cee (+53.1%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report'
  Aggregation: min
  z-score (stddev): ↑ 0.50
  Head: μ: 27.8ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.6ms σ: 4.2ms MAD: 2.3ms n: 33
   [-33.89% – +12.31%] █▆▇█▆█▇▆▇▆▁▁▇▁▆▇▇▇▆▇▆▁▁▁▇▆▅▁▄▄▄▅▅▆

Auditing measurement "report" (os=ubuntu-22.04/rust=stable):
  ⚠️  WARNING: Change points detected in current epoch for 'report':
     commit b194e1c (-63.8%)
     commit c2dc7a7 (+189.9%)
     commit d21b515 (+33.7%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report'
  Aggregation: min
  z-score (stddev): ↓ 1.26
  Head: μ: 18.8ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.1ms σ: 5ms MAD: 2.1ms n: 30
   [-64.80% – +9.11%] ▄█▇▄█▇▁▇█▇██▇█▆▇█▄▇▄█▇█▄▆▆▇▄▆▆▄

Auditing measurement "report-benchmark" (os=ubuntu-22.04/rust=beta):
  ⚠️  WARNING: Change points detected in current epoch for 'report-benchmark':
     commit 6e3baa3 (-35.5%)
     commit ee82970 (+59.6%)
     commit 5533710 (-37.9%)
     commit 8ef210c (+53.9%)
     commit 3541c6a (-34.3%)
     commit dc90cee (+53.0%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report-benchmark'
  Aggregation: min
  z-score (stddev): ↑ 0.44
  Head: μ: 27.7ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.8ms σ: 4.3ms MAD: 2.1ms n: 31
   [-33.53% – +11.44%] █▆▇█▆▇▇▆▇▆▁▁▇▁▆▇▇▆▆▇▆▁▁▁▇▆▅▁▅▄▅▆

Auditing measurement "report-benchmark" (os=ubuntu-22.04/rust=stable):
  ⚠️  WARNING: Change points detected in current epoch for 'report-benchmark':
     commit b194e1c (-64.1%)
     commit c2dc7a7 (+190.0%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report-benchmark'
  Aggregation: min
  z-score (stddev): ↓ 1.27
  Head: μ: 18.6ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.2ms σ: 5.2ms MAD: 2ms n: 28
   [-65.10% – +10.16%] ▄▇▇▄█▇▁▇█▇▇█▇▇▆▇█▄▇▄█▇█▄▆▆▆▄▄

Auditing measurement "report-size" (os=ubuntu-22.04/rust=beta):
  ✅ 'report-size'
  Aggregation: min
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 33
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Auditing measurement "report-size" (os=ubuntu-22.04/rust=stable):
  ✅ 'report-size'
  Aggregation: min
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 30
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Auditing measurement "report-size-benchmark" (os=ubuntu-22.04/rust=beta):
  ✅ 'report-size-benchmark'
  Aggregation: median
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 33
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Auditing measurement "report-size-benchmark" (os=ubuntu-22.04/rust=stable):
  ✅ 'report-size-benchmark'
  Aggregation: median
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 30
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Overall: PASSED (15/15 groups passed)

Measurement Storage Size

Live Measurement Size Report
============================

⚠️  Shallow clone detected - measurement counts may be incomplete (see FAQ)

Number of commits with measurements: 368
Total measurement data size (on-disk (compressed)): 527.1kB

Repository Statistics (for context):
-------------------------------------
  Loose objects: 0 (0B)
  Packed objects: 20346 (3.9MB)
  Total repository size: 3.9MB

Created by git-perf

The study workflow previously only supported git perf measure (wall-clock subprocess timing). Criterion benchmarks produce per-item statistics via cargo criterion's JSON output, which must be imported with git perf import criterion-json rather than timed as a subprocess. Add a mode input (measure / criterion-json, default: measure): - measure: existing behaviour — git perf measure -n $reps times the command - criterion-json: pipes the command into git perf import criterion-json with --metadata group=$INSTANCE so git perf study can compute between-runner CoV; cargo-criterion is installed automatically reps_per_instance is ignored in criterion-json mode since criterion collects its own samples internally. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

Cross-runner study (10 independent instances) measured: between-runner CoV = 34.8% within-runner MAD% = 1.4% (MAD/σ = 0.04 — outlier runners detected) The benchmark is precise within any given runner but GitHub shared-runner hardware heterogeneity dominates the between-run variance. Set min_relative_deviation=52 (1.5× CoV) and max_cov=70 (2× CoV) per the study recommendation. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

The first study run (parallel, 10 instances) measured CoV=34.8% — inflated by resource contention between concurrent runner VMs sharing physical hosts. Historical sequential CI data shows ~3-5% CoV, reflecting real conditions. Add max-parallel: 1 to the study measure job so instances run one at a time, eliminating contention artifacts and matching production CI behaviour. Reset min_relative_deviation=10 and max_cov=20 conservatively pending a fresh sequential study. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

Sequential cross-runner study (10 instances, max-parallel=1) measured CoV=8.0%, MAD%=4.6% — confirming the earlier 34.8% was an artifact of concurrent runner resource contention, not genuine benchmark noise. Apply study recommendations: aggregate_by=min, min_relative_deviation=12 (1.5× CoV), max_cov=16 (2× CoV), min_measurements=3. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

claude added 5 commits June 8, 2026 19:12

ci: document why benchmark study instances run serially

1731484

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

kaihowl marked this pull request as ready for review June 9, 2026 11:13

kaihowl merged commit 01c84f0 into master Jun 9, 2026
23 checks passed

kaihowl deleted the claude/benchmark-stability-8lq5ck branch June 9, 2026 11:13

release-plz-token Bot mentioned this pull request Jun 8, 2026

chore: release #690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ci): increase add_measurements base case from 1 to 10 items#733

perf(ci): increase add_measurements base case from 1 to 10 items#733
kaihowl merged 7 commits into
masterfrom
claude/benchmark-stability-8lq5ck

kaihowl commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaihowl commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Study results

Test plan

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧬 PR Mutation Testing Results

📦 Artifacts

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Report

Audit Results

Measurement Storage Size

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaihowl commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading