Skip to content

Add Terminal Benchmark example drill#25

Merged
Abiorh001 merged 1 commit into
mainfrom
codex/terminal-benchmark-example
Jun 21, 2026
Merged

Add Terminal Benchmark example drill#25
Abiorh001 merged 1 commit into
mainfrom
codex/terminal-benchmark-example

Conversation

@Abiorh001

@Abiorh001 Abiorh001 commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Chunk

Examples-only Terminal Benchmark reference drill.

Goal

Move the useful Terminal Benchmark real API drill material out of dirty local state and into an explicit examples/terminal_benchmark/ boundary.

Human-Approved Intent

This branch is not a Workstream product chunk. The user directed that Terminal Benchmark test material should not be applied into backend runtime or main product code, but may live under an examples/ directory as reference/demo material.

What Changed

  • Added examples/terminal_benchmark/terminal_benchmark_api_e2e.py.
  • Added examples/terminal_benchmark/README.md.
  • Added examples/terminal_benchmark/LOCAL_VALIDATION_NOTES.md.

Why It Changed

The old dirty work contained useful real-world Terminal Benchmark validation logic. Keeping it under examples/ preserves the learning without turning it into runtime code, required CI, or formal zero-trust loop evidence.

Scope Control

Allowed Files Changed

  • examples/terminal_benchmark/**

Files Outside Contract

  • None

Product Behavior

  • No Workstream product behavior changed.
  • Product behavior changed and is explained here:

This does not touch backend app code, migrations, workflows, auth, checker runtime behavior, or frontend product code.

Evidence

Commands Run

python3 -m py_compile examples/terminal_benchmark/terminal_benchmark_api_e2e.py
cd backend && .venv/bin/python -m ruff check ../examples/terminal_benchmark
cd backend && .venv/bin/python - <<'PY'
# database guard probe: accepts only local async Postgres test DB URLs and rejects unsafe URLs including ?host=example.com
PY
python3 scripts/check_internal_review_evidence.py
python3 scripts/workstream_agent_gate.py --base origin/main --head HEAD --format json
python3 scripts/check_markdown_links.py
python3 scripts/check_stale_workstream_wording.py
git diff --check
git diff --cached --check

Result Summary

compile: pass
ruff: pass
database guard probe: pass
internal review evidence gate: no evidence required for examples-only change
agent static gate: WARN only for large examples-only diff size
Markdown links: pass
stale wording: pass
diff checks: pass
private path staged scan: pass

Acceptance Criteria Proof

  • Terminal Benchmark material is under examples/terminal_benchmark/.
  • No backend runtime files changed.
  • No CI workflow files changed.
  • README says this is not runtime code, not required CI, not formal internal review evidence, and not canonical checker implementation.
  • Local validation notes avoid formal zero-trust evidence wording.
  • The example rejects non-local or query-param database URLs.
  • No private absolute fixture paths are staged.

Test Delta

Tests Added

  • None. This is an example script, not required CI.

Tests Modified

  • None.

Tests Removed Or Skipped

  • None.

Internal Reviewer Results

Reviewed code SHA: 0a6b66f

Reviewed at: 2026-06-21

Reviewer run IDs: 019eea24-2f3f-7c30-9102-569b3fa680c2, 019eea25-448b-7701-b0a7-5e774b3146c2, 019eea26-fe76-75c0-a847-7740446e2ff3, 019eea29-7398-7c31-a218-a31477bdaed3, 019eea46-6e49-7891-9984-33590a730724, 019eea47-911c-7750-b965-213a43eedbbd, 019eea55-2889-7d00-8773-64ef565a7c1a, 019eea61-e60c-7140-933a-33c6cf274689

Reviewer Result Blocking Findings Notes
Senior engineering PASS WITH LOW RISKS None remaining Large diff is isolated example material; indentation wart fixed.
QA/test PASS None Example is not wired into required CI; local validation commands are appropriate.
Security/auth PASS None Private path and DB query-param bypass findings fixed; staged path scan clean.
Product/ops PASS None README and notes clearly separate examples from runtime, CI, and formal loop evidence.
Architecture N/A - with approved reason None Examples-only location; no architecture contract changed.
CI integrity N/A - with approved reason None No workflow or required CI files changed.
Docs PASS None README and local notes reviewed.
Reuse/dedup N/A - with approved reason None No shared implementation added.
Test delta N/A - with approved reason None No test suite changes; example script remains manual.

External Review

External review response file: None for this examples-only PR.

Source Status Notes
CodeRabbit Pending
GitHub checks Pending

CI And Gate Integrity

  • No workflow weakening.
  • No lint/test/docstring gate weakening.
  • No coverage threshold weakening.
  • No package script weakening.
  • No unpinned new GitHub Action.
  • Checkout credential persistence disabled where checkout is used.

Remaining Risks

  • The example script is intentionally large because it preserves a full real API drill. It is isolated under examples/ and is not runtime or CI behavior.

Summary by CodeRabbit

  • Documentation

    • Added setup guide for Terminal Benchmark example, including local requirements and usage instructions.
    • Added validation notes documenting local validation procedures and results.
  • Tests

    • Added end-to-end API validation script for Terminal Benchmark, running complete Week 1 + Week 2 workflow scenarios against a local test database with fixture data.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 63bc68d5-8254-44fd-af5c-0095a284794a

📥 Commits

Reviewing files that changed from the base of the PR and between 53a0339 and 0a6b66f.

📒 Files selected for processing (3)
  • examples/terminal_benchmark/LOCAL_VALIDATION_NOTES.md
  • examples/terminal_benchmark/README.md
  • examples/terminal_benchmark/terminal_benchmark_api_e2e.py

📝 Walkthrough

Walkthrough

Three files are added under examples/terminal_benchmark/: a README, a LOCAL_VALIDATION_NOTES document, and a 1114-line Python e2e drill script. The script loads a real fixture directory, builds Workstream API payloads, starts a local API server, runs complete and revision submission scenarios, and performs deep Postgres invariant checks.

Changes

Terminal Benchmark example directory

Layer / File(s) Summary
README and validation notes
examples/terminal_benchmark/README.md, examples/terminal_benchmark/LOCAL_VALIDATION_NOTES.md
README documents directory intent, env requirements (WORKSTREAM_DATABASE_URL, WORKSTREAM_TERMINAL_BENCH_FIXTURE), and sample invocation. LOCAL_VALIDATION_NOTES records historical shell commands, quality check targets, and pass/fail outcomes.
Module setup, fixture dataclasses, and loading
examples/terminal_benchmark/terminal_benchmark_api_e2e.py (lines 1–286)
Injects backend import paths; imports Week 1/Week 2 helpers and DB models; defines constants; introduces FixtureFile and TerminalBenchmarkFixture dataclasses; implements fixture root resolution, required-file validation, SHA256-based stable fixture ID derivation, and manifest/hashing helpers.
Payload builders and persistence projections
examples/terminal_benchmark/terminal_benchmark_api_e2e.py (lines 289–462)
Builds task, guide, and submission payloads from fixture metadata, including optional static_guard.txt omission to simulate an invalid submission; derives expected artifact manifest and evidence field projections for later invariant assertions.
API orchestration and assertion helpers
examples/terminal_benchmark/terminal_benchmark_api_e2e.py (lines 464–740)
Implements checker-run aggregate assertions, a precheck no-persist safety check, project/guide creation with v1 activation, task create/screen/release/claim/start with demo worker-profile bootstrap, submission workflow, task-status poll helpers, and assert_database_invariants that queries and validates all persisted rows (task, submission, checker runs, evidence, checker results, audit events).
Full e2e scenario runner and entrypoint
examples/terminal_benchmark/terminal_benchmark_api_e2e.py (lines 802–1114)
exercise_terminal_benchmark_api runs complete and revision scenarios end-to-end with precheck, checker status/routing/count assertions, task-status waits, DB invariant verification, and structured summary output. The __main__ entrypoint enforces a local-only DB URL, upgrades the schema via Alembic, starts the API server via the Week 2 harness, runs the drill, and terminates the server.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 A fixture of files, hashed with care,
Submissions precheck'd, scenarios laid bare.
The checker runs tick, the DB rows align,
Complete and revision both pass the design.
With static_guard gone—revision ensues,
But all invariants pass, so I hop with good news! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'Add Terminal Benchmark example drill' accurately summarizes the main change: adding a Terminal Benchmark example directory with supporting documentation and an e2e script.
Description check ✅ Passed The PR description is comprehensive and addresses all major template sections including Goal, Human-Approved Intent, What Changed, Why It Changed, Scope Control, Product Behavior, Evidence, Acceptance Criteria, Test Delta, Internal Reviewer Results with complete sign-offs, CI And Gate Integrity checks, and Remaining Risks.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/terminal-benchmark-example

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Abiorh001 Abiorh001 merged commit e36a5fe into main Jun 21, 2026
5 checks passed
@Abiorh001 Abiorh001 deleted the codex/terminal-benchmark-example branch June 21, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant