Skip to content

Add benchmark default matrix reporting#206

Merged
ictechgy merged 1 commit into
mainfrom
g005-paired-replay-default-matrix
Jun 15, 2026
Merged

Add benchmark default matrix reporting#206
ictechgy merged 1 commit into
mainfrom
g005-paired-replay-default-matrix

Conversation

@ictechgy

Copy link
Copy Markdown
Owner

Summary

  • add context-guard-bench report-level default_matrix with six token-reduction lanes: trimming, artifact escrow, tool pruning, cache advice, adaptive-k, and optional compression
  • classify lanes as default-on, advisory, experimental, or reject/rework from existing matched-pair evidence, with lane match method, policy ceiling/clamp, reason codes, and report-only claim boundaries
  • render the matrix in benchmark dashboards and update benchmark docs/sample report/tests

Ralplan evidence

  • Context snapshot: .omx/context/g005-paired-replay-default-matrix-20260615T042329Z.md
  • Plan: .omx/plans/ralplan-g005-paired-replay-default-matrix.md
  • Architect: APPROVE, blockers none — .omx/artifacts/ralplan-g005-architect-20260615T042403Z.md
  • Critic: APPROVE, blockers none — .omx/artifacts/ralplan-g005-critic-20260615T042527Z.md

Validation

  • python3 scripts/sync_plugin_copies.py --check
  • python3 -m py_compile context-guard-kit/benchmark_runner.py plugins/context-guard/bin/context-guard-bench tests/test_context_guard_kit.py scripts/release_smoke.py
  • PYTHONDONTWRITEBYTECODE=1 python3 -m unittest -k benchmark tests.test_context_guard_kit.BenchmarkRunnerTests (35 tests) ✅
  • evidence replay smoke for docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl emitted default_matrix JSON and ## Default matrix dashboard ✅
  • python3 scripts/release_smoke.py --timeout 20
  • PYTHONDONTWRITEBYTECODE=1 python3 scripts/prepublish_check.py (697 tests) ✅
  • git diff --check

@ictechgy

Copy link
Copy Markdown
Owner Author

G005 quad-review + validation evidence before merge:

  • Local validation passed:
    • python3 scripts/sync_plugin_copies.py --check
    • python3 -m py_compile context-guard-kit/benchmark_runner.py plugins/context-guard/bin/context-guard-bench tests/test_context_guard_kit.py scripts/release_smoke.py
    • PYTHONDONTWRITEBYTECODE=1 python3 -m unittest -k benchmark tests.test_context_guard_kit.BenchmarkRunnerTests (35 tests)
    • 12-task replay smoke for docs/benchmark-fixtures/token-savings-12task.* verified default_matrix schema/public-claim false/6 lanes/classifications/dashboard header
    • python3 scripts/release_smoke.py --timeout 20
    • PYTHONDONTWRITEBYTECODE=1 python3 scripts/prepublish_check.py (697 tests)
    • git diff --check
  • PR CI passed: test-and-prepublish on Python 3.11, Python 3.12, and macOS 3.12 (run 27524671842).
  • Quad review loop: APPROVE / no blockers from Codex, Claude, Agy, and Forge.
    • Codex: .omx/artifacts/quad-review-pr206-codex-20260615T050757Z.md
    • Claude: .omx/artifacts/quad-review-pr206-claude-20260615T050757Z.md
    • Agy: .omx/artifacts/quad-review-pr206-agy-20260615T050757Z.md
    • Forge fallback: .omx/artifacts/quad-review-pr206-forge-fallback-20260615T064824Z.md

Nonblocking follow-ups captured from review: lane attribution remains key/name heuristic based for future fixtures; future non-current policy ceilings may need explicit clamp handling. Neither blocks current G005 scope.

@ictechgy ictechgy merged commit be54adc into main Jun 15, 2026
3 checks passed
@ictechgy ictechgy deleted the g005-paired-replay-default-matrix branch June 15, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant