Skip to content

perf(ci): increase add_measurements base case from 1 to 10 items#733

Merged
kaihowl merged 7 commits into
masterfrom
claude/benchmark-stability-8lq5ck
Jun 9, 2026
Merged

perf(ci): increase add_measurements base case from 1 to 10 items#733
kaihowl merged 7 commits into
masterfrom
claude/benchmark-stability-8lq5ck

Conversation

@kaihowl

@kaihowl kaihowl commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Summary

  • Replace the add_measurements benchmark's [1, 50, 100] item variants with a single 50-item variant, eliminating CI time spent on variants that are never audited
  • The 1-item variant had 13.8% between-runner CoV (subprocess startup variance dominated); 50 items averages across 50 subprocess calls, reducing CoV to ~8%
  • Add criterion-json mode to benchmark-study.yml so Criterion benchmarks can be studied directly (previously the workflow only supported git perf measure wall-clock timing)
  • Serialize study instances with max-parallel: 1 to prevent concurrent runner VMs from competing for CPU on shared physical hosts, which inflated CoV by ~4× in benchmarks
  • Calibrate add_measurement/50::median thresholds from a sequential 10-instance study: CoV=8.0%, MAD%=4.6% → min_relative_deviation=12, max_cov=16

Study results

Two studies were run:

  • Parallel (10 concurrent instances): CoV=34.8% — inflated by runner resource contention
  • Sequential (10 instances, max-parallel: 1): CoV=8.0% ✅ — matches historical CI variance

Test plan

  • CI passes with the new single add_measurement/50 variant
  • Audit activates after 3 CI runs accumulate past the epoch
  • Future benchmark studies use mode: criterion-json + max-parallel: 1 for accurate CoV measurement

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk

Per-element time with a single item is dominated by subprocess startup
variance (CoV 13.8%). Using 10 items per iteration averages across 10
independent subprocess calls, reducing CoV by ~√10. Update the audit
measurement name, reset its epoch, and set conservative thresholds
(min_relative_deviation=10, max_cov=20) pending a new calibration study.

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🧬 PR Mutation Testing Results

Status: ⚠️ No mutation report generated
Note: This may indicate no mutants were generated from the diff.

📦 Artifacts

Download full mutation report

Running [10,50,100] variants wastes CI time; only the single audited
variant needs to run. Historical CI data shows the 50-item variant has
~3.3% cross-runner CoV (vs 13.8% for 1 item), so it is retained as the
sole variant. Update audit target to add_measurement/50, tighten
thresholds (min_relative_deviation=7, max_cov=10) accordingly, and
reset epoch for a clean baseline.

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Performance Report

Performance Results

Audit Results

Auditing measurement "add-benchmark" (os=ubuntu-22.04/rust=beta):
  ⚠️  WARNING: Change points detected in current epoch for 'add-benchmark':
     commit 6e3baa3 (-36.1%)
     commit ee82970 (+63.3%)
     commit 5533710 (-39.0%)
     commit 8ef210c (+53.1%)
     commit 3541c6a (-38.1%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'add-benchmark'
  Aggregation: min
  z-score (mad): ↑ 1.04
  Head: μ: 23.7ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 21.8ms σ: 4ms MAD: 1.9ms n: 31
   [-38.10% – +10.13%] █▇▇█▆██▇█▇▂▂█▂▇▇█▆▇█▆▁▁▁▇▆▄▁▄▄▄▇

Auditing measurement "add-benchmark" (os=ubuntu-22.04/rust=stable):
  ⚠️  WARNING: Change points detected in current epoch for 'add-benchmark':
     commit b194e1c (-63.3%)
     commit c2dc7a7 (+187.3%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'add-benchmark'
  Aggregation: min
  z-score (mad): ↓ 2.94
  Head: μ: 15.6ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 21.3ms σ: 4.6ms MAD: 1.9ms n: 28
   [-64.47% – +10.59%] ▄▇▇▄█▇▁▇█▇▇█▇▇▆▇█▄▇▄▇▇▇▃▆▆▆▃▄

Auditing measurement "bench::add_measurements/add_measurement/50::median" (os=ubuntu-22.04/rust=beta):
  ⏭️ 'bench::add_measurements/add_measurement/50::median'
  Only 0 historical measurements found. Less than requested min_measurements of 3. Skipping test.
  Head: μ: 860.6ms σ: N/A MAD: 0ns n: 1

Auditing measurement "bench::add_measurements/add_measurement/50::median" (os=ubuntu-22.04/rust=stable):
  ⏭️ 'bench::add_measurements/add_measurement/50::median'
  Only 0 historical measurements found. Less than requested min_measurements of 3. Skipping test.
  Head: μ: 559ms σ: N/A MAD: 0ns n: 1

Auditing measurement "bench::report/report_generation/10::median" (os=ubuntu-22.04/rust=beta):
  ✅ 'bench::report/report_generation/10::median'
  Aggregation: median
  z-score (mad): ↓ 2.43
  Head: μ: 21.8ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 22.7ms σ: 840.5µs MAD: 355.4µs n: 7
   [-6.93% – +1.53%] ▇▁▃█▁▇▇▂

Auditing measurement "bench::report/report_generation/10::median" (os=ubuntu-22.04/rust=stable):
  ✅ 'bench::report/report_generation/10::median'
  Aggregation: median
  z-score (mad): ↓ 2.75
  Head: μ: 14.3ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 18.2ms σ: 6.3ms MAD: 1.4ms n: 7
   [-67.12% – +6.48%] ▄██▄█▇▁▄

Auditing measurement "release-binary-size" (os=ubuntu-22.04/rust=stable):
  ✅ 'release-binary-size'
  Aggregation: min
  z-score (stddev): ↑ 1.04
  Head: μ: 8.3MB σ: N/A MAD: 0B n: 1
  Tail: μ: 8.2MB σ: 78.6kB MAD: 35.3kB n: 36
   [-2.57% – +0.61%] ████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▅▅▅▅▅▅▅▁▂▂▂▂▂▂█

Auditing measurement "report" (os=ubuntu-22.04/rust=beta):
  ⚠️  WARNING: Change points detected in current epoch for 'report':
     commit 6e3baa3 (-36.1%)
     commit ee82970 (+61.2%)
     commit 5533710 (-37.7%)
     commit 8ef210c (+53.8%)
     commit 3541c6a (-34.4%)
     commit dc90cee (+53.1%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report'
  Aggregation: min
  z-score (stddev): ↑ 0.50
  Head: μ: 27.8ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.6ms σ: 4.2ms MAD: 2.3ms n: 33
   [-33.89% – +12.31%] █▆▇█▆█▇▆▇▆▁▁▇▁▆▇▇▇▆▇▆▁▁▁▇▆▅▁▄▄▄▅▅▆

Auditing measurement "report" (os=ubuntu-22.04/rust=stable):
  ⚠️  WARNING: Change points detected in current epoch for 'report':
     commit b194e1c (-63.8%)
     commit c2dc7a7 (+189.9%)
     commit d21b515 (+33.7%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report'
  Aggregation: min
  z-score (stddev): ↓ 1.26
  Head: μ: 18.8ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.1ms σ: 5ms MAD: 2.1ms n: 30
   [-64.80% – +9.11%] ▄█▇▄█▇▁▇█▇██▇█▆▇█▄▇▄█▇█▄▆▆▇▄▆▆▄

Auditing measurement "report-benchmark" (os=ubuntu-22.04/rust=beta):
  ⚠️  WARNING: Change points detected in current epoch for 'report-benchmark':
     commit 6e3baa3 (-35.5%)
     commit ee82970 (+59.6%)
     commit 5533710 (-37.9%)
     commit 8ef210c (+53.9%)
     commit 3541c6a (-34.3%)
     commit dc90cee (+53.0%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report-benchmark'
  Aggregation: min
  z-score (stddev): ↑ 0.44
  Head: μ: 27.7ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.8ms σ: 4.3ms MAD: 2.1ms n: 31
   [-33.53% – +11.44%] █▆▇█▆▇▇▆▇▆▁▁▇▁▆▇▇▆▆▇▆▁▁▁▇▆▅▁▅▄▅▆

Auditing measurement "report-benchmark" (os=ubuntu-22.04/rust=stable):
  ⚠️  WARNING: Change points detected in current epoch for 'report-benchmark':
     commit b194e1c (-64.1%)
     commit c2dc7a7 (+190.0%)
     Historical z-score comparison may be unreliable due to regime shift.
     Consider bumping epoch or investigating the change.
  ✅ 'report-benchmark'
  Aggregation: min
  z-score (stddev): ↓ 1.27
  Head: μ: 18.6ms σ: N/A MAD: 0ns n: 1
  Tail: μ: 25.2ms σ: 5.2ms MAD: 2ms n: 28
   [-65.10% – +10.16%] ▄▇▇▄█▇▁▇█▇▇█▇▇▆▇█▄▇▄█▇█▄▆▆▆▄▄

Auditing measurement "report-size" (os=ubuntu-22.04/rust=beta):
  ✅ 'report-size'
  Aggregation: min
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 33
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Auditing measurement "report-size" (os=ubuntu-22.04/rust=stable):
  ✅ 'report-size'
  Aggregation: min
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 30
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Auditing measurement "report-size-benchmark" (os=ubuntu-22.04/rust=beta):
  ✅ 'report-size-benchmark'
  Aggregation: median
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 33
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Auditing measurement "report-size-benchmark" (os=ubuntu-22.04/rust=stable):
  ✅ 'report-size-benchmark'
  Aggregation: median
  z-score (stddev): →
  Head: μ: 20.3kB σ: N/A MAD: 0B n: 1
  Tail: μ: 20.3kB σ: 0B MAD: 0B n: 30
   [+0.00% – +0.00%] ▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Overall: PASSED (15/15 groups passed)

Measurement Storage Size

Live Measurement Size Report
============================

⚠️  Shallow clone detected - measurement counts may be incomplete (see FAQ)

Number of commits with measurements: 368
Total measurement data size (on-disk (compressed)): 527.1kB

Repository Statistics (for context):
-------------------------------------
  Loose objects: 0 (0B)
  Packed objects: 20346 (3.9MB)
  Total repository size: 3.9MB

Created by git-perf

claude added 5 commits June 8, 2026 19:12
The study workflow previously only supported git perf measure (wall-clock
subprocess timing).  Criterion benchmarks produce per-item statistics via
cargo criterion's JSON output, which must be imported with
git perf import criterion-json rather than timed as a subprocess.

Add a mode input (measure / criterion-json, default: measure):
- measure: existing behaviour — git perf measure -n $reps times the command
- criterion-json: pipes the command into git perf import criterion-json
  with --metadata group=$INSTANCE so git perf study can compute
  between-runner CoV; cargo-criterion is installed automatically

reps_per_instance is ignored in criterion-json mode since criterion
collects its own samples internally.

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
Cross-runner study (10 independent instances) measured:
  between-runner CoV = 34.8%
  within-runner MAD% = 1.4%  (MAD/σ = 0.04 — outlier runners detected)

The benchmark is precise within any given runner but GitHub shared-runner
hardware heterogeneity dominates the between-run variance.  Set
min_relative_deviation=52 (1.5× CoV) and max_cov=70 (2× CoV) per the
study recommendation.

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
The first study run (parallel, 10 instances) measured CoV=34.8% — inflated
by resource contention between concurrent runner VMs sharing physical hosts.
Historical sequential CI data shows ~3-5% CoV, reflecting real conditions.

Add max-parallel: 1 to the study measure job so instances run one at a time,
eliminating contention artifacts and matching production CI behaviour.

Reset min_relative_deviation=10 and max_cov=20 conservatively pending a
fresh sequential study.

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
Sequential cross-runner study (10 instances, max-parallel=1) measured
CoV=8.0%, MAD%=4.6% — confirming the earlier 34.8% was an artifact of
concurrent runner resource contention, not genuine benchmark noise.

Apply study recommendations: aggregate_by=min, min_relative_deviation=12
(1.5× CoV), max_cov=16 (2× CoV), min_measurements=3.

https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
@kaihowl kaihowl marked this pull request as ready for review June 9, 2026 11:13
@kaihowl kaihowl merged commit 01c84f0 into master Jun 9, 2026
23 checks passed
@kaihowl kaihowl deleted the claude/benchmark-stability-8lq5ck branch June 9, 2026 11:13
@release-plz-token release-plz-token Bot mentioned this pull request Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants