Add Terminal Benchmark example drill by Abiorh001 · Pull Request #25 · Flow-Research/workstream

Abiorh001 · 2026-06-21T13:45:12Z

Chunk

Examples-only Terminal Benchmark reference drill.

Goal

Move the useful Terminal Benchmark real API drill material out of dirty local state and into an explicit examples/terminal_benchmark/ boundary.

Human-Approved Intent

This branch is not a Workstream product chunk. The user directed that Terminal Benchmark test material should not be applied into backend runtime or main product code, but may live under an examples/ directory as reference/demo material.

What Changed

Added examples/terminal_benchmark/terminal_benchmark_api_e2e.py.
Added examples/terminal_benchmark/README.md.
Added examples/terminal_benchmark/LOCAL_VALIDATION_NOTES.md.

Why It Changed

The old dirty work contained useful real-world Terminal Benchmark validation logic. Keeping it under examples/ preserves the learning without turning it into runtime code, required CI, or formal zero-trust loop evidence.

Scope Control

Allowed Files Changed

examples/terminal_benchmark/**

Files Outside Contract

None

Product Behavior

No Workstream product behavior changed.
Product behavior changed and is explained here:

This does not touch backend app code, migrations, workflows, auth, checker runtime behavior, or frontend product code.

Evidence

Commands Run

python3 -m py_compile examples/terminal_benchmark/terminal_benchmark_api_e2e.py
cd backend && .venv/bin/python -m ruff check ../examples/terminal_benchmark
cd backend && .venv/bin/python - <<'PY'
# database guard probe: accepts only local async Postgres test DB URLs and rejects unsafe URLs including ?host=example.com
PY
python3 scripts/check_internal_review_evidence.py
python3 scripts/workstream_agent_gate.py --base origin/main --head HEAD --format json
python3 scripts/check_markdown_links.py
python3 scripts/check_stale_workstream_wording.py
git diff --check
git diff --cached --check

Result Summary

compile: pass
ruff: pass
database guard probe: pass
internal review evidence gate: no evidence required for examples-only change
agent static gate: WARN only for large examples-only diff size
Markdown links: pass
stale wording: pass
diff checks: pass
private path staged scan: pass

Acceptance Criteria Proof

Terminal Benchmark material is under examples/terminal_benchmark/.
No backend runtime files changed.
No CI workflow files changed.
README says this is not runtime code, not required CI, not formal internal review evidence, and not canonical checker implementation.
Local validation notes avoid formal zero-trust evidence wording.
The example rejects non-local or query-param database URLs.
No private absolute fixture paths are staged.

Test Delta

Tests Added

None. This is an example script, not required CI.

Tests Modified

None.

Tests Removed Or Skipped

None.

Internal Reviewer Results

Reviewed code SHA: 0a6b66f

Reviewed at: 2026-06-21

Reviewer run IDs: 019eea24-2f3f-7c30-9102-569b3fa680c2, 019eea25-448b-7701-b0a7-5e774b3146c2, 019eea26-fe76-75c0-a847-7740446e2ff3, 019eea29-7398-7c31-a218-a31477bdaed3, 019eea46-6e49-7891-9984-33590a730724, 019eea47-911c-7750-b965-213a43eedbbd, 019eea55-2889-7d00-8773-64ef565a7c1a, 019eea61-e60c-7140-933a-33c6cf274689

Reviewer	Result	Blocking Findings	Notes
Senior engineering	PASS WITH LOW RISKS	None remaining	Large diff is isolated example material; indentation wart fixed.
QA/test	PASS	None	Example is not wired into required CI; local validation commands are appropriate.
Security/auth	PASS	None	Private path and DB query-param bypass findings fixed; staged path scan clean.
Product/ops	PASS	None	README and notes clearly separate examples from runtime, CI, and formal loop evidence.
Architecture	N/A - with approved reason	None	Examples-only location; no architecture contract changed.
CI integrity	N/A - with approved reason	None	No workflow or required CI files changed.
Docs	PASS	None	README and local notes reviewed.
Reuse/dedup	N/A - with approved reason	None	No shared implementation added.
Test delta	N/A - with approved reason	None	No test suite changes; example script remains manual.

External Review

External review response file: None for this examples-only PR.

Source	Status	Notes
CodeRabbit	Pending
GitHub checks	Pending

CI And Gate Integrity

No workflow weakening.
No lint/test/docstring gate weakening.
No coverage threshold weakening.
No package script weakening.
No unpinned new GitHub Action.
Checkout credential persistence disabled where checkout is used.

Remaining Risks

The example script is intentionally large because it preserves a full real API drill. It is isolated under examples/ and is not runtime or CI behavior.

Summary by CodeRabbit

Documentation
- Added setup guide for Terminal Benchmark example, including local requirements and usage instructions.
- Added validation notes documenting local validation procedures and results.
Tests
- Added end-to-end API validation script for Terminal Benchmark, running complete Week 1 + Week 2 workflow scenarios against a local test database with fixture data.

coderabbitai · 2026-06-21T13:45:25Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 63bc68d5-8254-44fd-af5c-0095a284794a

📥 Commits

Reviewing files that changed from the base of the PR and between 53a0339 and 0a6b66f.

📒 Files selected for processing (3)

examples/terminal_benchmark/LOCAL_VALIDATION_NOTES.md
examples/terminal_benchmark/README.md
examples/terminal_benchmark/terminal_benchmark_api_e2e.py

📝 Walkthrough

Walkthrough

Three files are added under examples/terminal_benchmark/: a README, a LOCAL_VALIDATION_NOTES document, and a 1114-line Python e2e drill script. The script loads a real fixture directory, builds Workstream API payloads, starts a local API server, runs complete and revision submission scenarios, and performs deep Postgres invariant checks.

Changes

Terminal Benchmark example directory

Layer / File(s)	Summary
README and validation notes `examples/terminal_benchmark/README.md`, `examples/terminal_benchmark/LOCAL_VALIDATION_NOTES.md`	README documents directory intent, env requirements (`WORKSTREAM_DATABASE_URL`, `WORKSTREAM_TERMINAL_BENCH_FIXTURE`), and sample invocation. LOCAL_VALIDATION_NOTES records historical shell commands, quality check targets, and pass/fail outcomes.
Module setup, fixture dataclasses, and loading `examples/terminal_benchmark/terminal_benchmark_api_e2e.py` (lines 1–286)	Injects backend import paths; imports Week 1/Week 2 helpers and DB models; defines constants; introduces `FixtureFile` and `TerminalBenchmarkFixture` dataclasses; implements fixture root resolution, required-file validation, SHA256-based stable fixture ID derivation, and manifest/hashing helpers.
Payload builders and persistence projections `examples/terminal_benchmark/terminal_benchmark_api_e2e.py` (lines 289–462)	Builds task, guide, and submission payloads from fixture metadata, including optional `static_guard.txt` omission to simulate an invalid submission; derives expected artifact manifest and evidence field projections for later invariant assertions.
API orchestration and assertion helpers `examples/terminal_benchmark/terminal_benchmark_api_e2e.py` (lines 464–740)	Implements checker-run aggregate assertions, a precheck no-persist safety check, project/guide creation with `v1` activation, task create/screen/release/claim/start with demo worker-profile bootstrap, submission workflow, task-status poll helpers, and `assert_database_invariants` that queries and validates all persisted rows (task, submission, checker runs, evidence, checker results, audit events).
Full e2e scenario runner and entrypoint `examples/terminal_benchmark/terminal_benchmark_api_e2e.py` (lines 802–1114)	`exercise_terminal_benchmark_api` runs complete and revision scenarios end-to-end with precheck, checker status/routing/count assertions, task-status waits, DB invariant verification, and structured summary output. The `__main__` entrypoint enforces a local-only DB URL, upgrades the schema via Alembic, starts the API server via the Week 2 harness, runs the drill, and terminates the server.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 A fixture of files, hashed with care,
Submissions precheck'd, scenarios laid bare.
The checker runs tick, the DB rows align,
Complete and revision both pass the design.
With static_guard gone—revision ensues,
But all invariants pass, so I hop with good news! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title 'Add Terminal Benchmark example drill' accurately summarizes the main change: adding a Terminal Benchmark example directory with supporting documentation and an e2e script.
Description check	✅ Passed	The PR description is comprehensive and addresses all major template sections including Goal, Human-Approved Intent, What Changed, Why It Changed, Scope Control, Product Behavior, Evidence, Acceptance Criteria, Test Delta, Internal Reviewer Results with complete sign-offs, CI And Gate Integrity checks, and Remaining Risks.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/terminal-benchmark-example

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add Terminal Benchmark example drill

0a6b66f

Abiorh001 merged commit e36a5fe into main Jun 21, 2026
5 checks passed

Abiorh001 deleted the codex/terminal-benchmark-example branch June 21, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Terminal Benchmark example drill#25

Add Terminal Benchmark example drill#25
Abiorh001 merged 1 commit into
mainfrom
codex/terminal-benchmark-example

Abiorh001 commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abiorh001 commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Chunk

Goal

Human-Approved Intent

What Changed

Why It Changed

Scope Control

Allowed Files Changed

Files Outside Contract

Product Behavior

Evidence

Commands Run

Result Summary

Acceptance Criteria Proof

Test Delta

Tests Added

Tests Modified

Tests Removed Or Skipped

Internal Reviewer Results

External Review

CI And Gate Integrity

Remaining Risks

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abiorh001 commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading