perf(ci): increase add_measurements base case from 1 to 10 items#733
Merged
Conversation
Per-element time with a single item is dominated by subprocess startup variance (CoV 13.8%). Using 10 items per iteration averages across 10 independent subprocess calls, reducing CoV by ~√10. Update the audit measurement name, reset its epoch, and set conservative thresholds (min_relative_deviation=10, max_cov=20) pending a new calibration study. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
Contributor
🧬 PR Mutation Testing ResultsStatus: 📦 Artifacts |
Running [10,50,100] variants wastes CI time; only the single audited variant needs to run. Historical CI data shows the 50-item variant has ~3.3% cross-runner CoV (vs 13.8% for 1 item), so it is retained as the sole variant. Update audit target to add_measurement/50, tighten thresholds (min_relative_deviation=7, max_cov=10) accordingly, and reset epoch for a clean baseline. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
Contributor
Performance ReportAudit ResultsMeasurement Storage SizeCreated by git-perf |
The study workflow previously only supported git perf measure (wall-clock subprocess timing). Criterion benchmarks produce per-item statistics via cargo criterion's JSON output, which must be imported with git perf import criterion-json rather than timed as a subprocess. Add a mode input (measure / criterion-json, default: measure): - measure: existing behaviour — git perf measure -n $reps times the command - criterion-json: pipes the command into git perf import criterion-json with --metadata group=$INSTANCE so git perf study can compute between-runner CoV; cargo-criterion is installed automatically reps_per_instance is ignored in criterion-json mode since criterion collects its own samples internally. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
Cross-runner study (10 independent instances) measured: between-runner CoV = 34.8% within-runner MAD% = 1.4% (MAD/σ = 0.04 — outlier runners detected) The benchmark is precise within any given runner but GitHub shared-runner hardware heterogeneity dominates the between-run variance. Set min_relative_deviation=52 (1.5× CoV) and max_cov=70 (2× CoV) per the study recommendation. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
The first study run (parallel, 10 instances) measured CoV=34.8% — inflated by resource contention between concurrent runner VMs sharing physical hosts. Historical sequential CI data shows ~3-5% CoV, reflecting real conditions. Add max-parallel: 1 to the study measure job so instances run one at a time, eliminating contention artifacts and matching production CI behaviour. Reset min_relative_deviation=10 and max_cov=20 conservatively pending a fresh sequential study. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
Sequential cross-runner study (10 instances, max-parallel=1) measured CoV=8.0%, MAD%=4.6% — confirming the earlier 34.8% was an artifact of concurrent runner resource contention, not genuine benchmark noise. Apply study recommendations: aggregate_by=min, min_relative_deviation=12 (1.5× CoV), max_cov=16 (2× CoV), min_measurements=3. https://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
add_measurementsbenchmark's[1, 50, 100]item variants with a single 50-item variant, eliminating CI time spent on variants that are never auditedcriterion-jsonmode tobenchmark-study.ymlso Criterion benchmarks can be studied directly (previously the workflow only supportedgit perf measurewall-clock timing)max-parallel: 1to prevent concurrent runner VMs from competing for CPU on shared physical hosts, which inflated CoV by ~4× in benchmarksadd_measurement/50::medianthresholds from a sequential 10-instance study: CoV=8.0%, MAD%=4.6% →min_relative_deviation=12,max_cov=16Study results
Two studies were run:
max-parallel: 1): CoV=8.0% ✅ — matches historical CI varianceTest plan
add_measurement/50variantmode: criterion-json+max-parallel: 1for accurate CoV measurementhttps://claude.ai/code/session_01JmV4MgScaGHVQ5RarfoFNk