AI dev engine with cross-model auditing and persistent learning.
Throw an idea. Claude Code plans. Codex audits. Claude Code builds. Codex audits again. Lessons carry forward. Ships.
Phage is an AI development engine designed to drive coding agents toward a finished product with minimal manual intervention. Lessons from each run persist and carry forward to subsequent runs.
You describe what you want built. Phage manages the pipeline: requirements, planning, implementation, auditing, and learning. It pauses for human input when external access or credentials are needed (see Limitations).
Requirement → Design → [Codex audit] → Build → [Codex final audit] → Learn → Ship
|
↓
Inherit lessons to next gen
When multiple runs share the same state directory, lessons accumulate and propagate across projects and sessions.
Named after bacteriophages — biological agents that detect and eliminate specific targets. Phage does the same to code defects, and learns from every cycle.
Phage is for developers who already use Claude Code or Codex CLI and want to add structured auditing and learning to their AI-assisted workflow.
Prerequisites:
- Claude Code (Anthropic) — active subscription
- Codex CLI (OpenAI) — active subscription
- A Unix-like shell (macOS, Linux, or WSL on Windows)
No additional dependencies. No API keys beyond your existing subscriptions.
Phage is not a generic framework. It is built for Claude Code + Codex CLI. This is an opinionated choice.
- Claude Code excels at planning and building. It writes code, manages files, runs tests, and drives projects to completion.
- Codex CLI serves as the auditor. Using a different model for review means different blind spots, which is the premise behind cross-model auditing.
- In our SWE-bench evaluation (50 problems), the cross-model setup solved 3 problems that same-model audit missed, while breaking 2 others (net +1). See the benchmark section for details.
No API billing. Both Claude Code and Codex CLI run on your existing subscriptions. No per-token costs.
Want to use different models? Fork it. MIT license. The adapter and reviewer interfaces are documented in CONTRIBUTING.md.
"I want something like this" is enough. You don't need a perfect spec.
Phage asks structured questions to dig into your idea and distill it into a requirements document. 3-5 questions per round, up to 5 rounds. Vague input becomes binary (pass/fail) acceptance criteria.
You don't need to be an engineer. Just say what you want in your own words. Phage turns it into a buildable spec.
Self-review has a known blind spot: the same model that wrote the code tends to overlook the same issues when reviewing it.
Phage uses a different AI model as the auditor. Claude Code writes. Codex reviews. Different model, different blind spots. Exactly twice per task — once on the plan, once on the output. No more, no less.
| Role | Implementation |
|---|---|
| Worker (plan + build) | Claude Code (Anthropic) |
| Reviewer (audit) | Codex CLI (OpenAI) |
When you let Codex audit code, it invents new edge cases on every pass. New criteria appear. The bar keeps rising. The audit never ends. Implementation stalls. The goalposts move.
Phage solves this with three structural constraints:
1. Frozen criteria. Acceptance criteria are locked after user approval. Nobody — not Codex, not Claude Code — can modify them afterward.
2. Scoped audits.
The Codex review prompt explicitly states: "Judge ONLY against the
frozen acceptance criteria. Do NOT block for issues outside them."
Non-criteria observations go to an observations field — reference
material, not blocking reasons.
3. No re-audits.
Exactly two audits per task. Plan audit. Final audit. Regardless of
the verdict, there is no re-audit. revise → fix and proceed.
block → fix and proceed. This design eliminates the infinite
review loop by construction.
Six checks fire on every task, regardless of what the acceptance criteria say. They are designed to be non-configurable and non-optional — they trigger before the system considers whether the action might be acceptable.
| # | Gate | What it blocks |
|---|---|---|
| 1 | Secrets | API keys, tokens, credentials in output |
| 2 | Destructive ops | File/branch deletion without human approval |
| 3 | External sends | Emails, webhooks, external API calls |
| 4 | Uncommitted changes | Starting work on a dirty git tree |
| 5 | Doc drift | Finishing without updating documentation |
| 6 | No rollback | Irreversible operations without a recovery path |
Safety gates don't just stop. They stop, then auto-implement a local
alternative. In the Twin Run benchmark, Task 6 required an external
API call. The external_send gate blocked it. Phage automatically
switched to schema-based local validation and completed the task with
zero outbound network calls.
Details: docs/SAFETY_GATES.md
Phage supports persistent learning across projects and sessions when
multiple runs share the same PHAGE_STATE_DIR. Under this
configuration, lessons persist and propagate across runs.
For the Twin Run benchmark, this behavior is intentionally constrained: each lane gets an isolated state directory so learning stays run-scoped and the A/B comparison remains fair.
In other words:
- Phage as a system supports cross-project persistence
- Twin Run as a benchmark uses deliberate run-scoped isolation
Layer 1: Episodes Raw lessons from each task (always recorded)
Layer 2: Patterns Auto-compressed when the same lesson appears 3+ times
Layer 3: Promotions Proposed as permanent rules at 5+ occurrences
(human approval required before promotion)
Learning data grows continuously. Loading everything every time would slow things down. The three-layer architecture solves this:
- Layer 1 (Episodes) is written to both local and global storage
- Layer 2 (Patterns) auto-compresses 3+ same-category lessons — 10 raw entries become 1 pattern
- Layer 3 (Promotions) proposes patterns with 5+ occurrences as permanent rules. But only if the human approves. Never auto-promoted.
At the planning phase (Phase 2), Phage reads: local lessons (all) + global patterns (compressed) + global lessons (same-category + last 20, deduplicated). Because compressed patterns are prioritized, even 100 raw entries result in a lightweight read.
In benchmark mode, these layers accumulate within the current run only. Outside the benchmark, shared state allows them to persist across runs.
In the Twin Run benchmark, an XSS fix lesson from Task 1 ("use
textContent, not innerHTML — the reviewer will reject it") was
automatically applied in Task 4's form validator. The earlier failure
prevented a planning rework cycle in the later task.
In this 4-task benchmark, the Learning lane completed 15% faster (1546s vs 1820s) — consistent with reduced planning rework when prior lessons are available.
No npm install. No pip install. No build step.
Phage runs on skill files and CLI tools. You need Claude Code and Codex CLI. That's it.
No environment setup. No dependency hell. Start in minutes.
Twin Run is Phage's built-in A/B test. It runs the same tasks on two lanes side by side:
- Frozen: No learning. Every task starts from zero.
- Learning: Lessons accumulate. Later tasks inherit earlier lessons.
Same tasks, same audits. The comparison shows whether accumulated lessons affect completion time and audit outcomes.
| Metric | Frozen | Learning | Delta |
|---|---|---|---|
| Tasks succeeded | 4/4 | 4/4 | — |
| Total time | 1820s | 1546s | -15% |
| Codex final GO | 2/4 | 3/4 | +1 |
| Safety gate fires | 0 | 1 | external_send blocked |
| Quality checklist | — | 100% | — |
Evidence root:
results/evidence/mandatory-4-baseline.jsonSource run:results/runs/20260330-072752
Highlights:
- Task 3 (CLI Report): Learning lane achieved final GO; Frozen
stayed at
revise - Task 6 (Config Sanitizer): Safety gate correctly blocked an external API send and auto-implemented a local alternative
- Learning lane was 15% faster overall (4-task run, single seed) — inherited lessons reduced planning rework cycles
Extended run with additional tasks for stability and safety evidence.
| Metric | What it measures |
|---|---|
| Codex findings | Issues caught by the cross-model reviewer |
| Rework cycles | Fix-review loops after audit |
| Quality checklist | Per-task binary pass/fail checks |
| Completion time | Start to final audit GO |
| Safety gate fires | Correct activation of immutable safety gates |
| Learning transfer | Lessons from earlier tasks applied in later ones |
| Pattern compression | Automatic compression of recurring lessons |
We ran Phage against SWE-bench Verified — a curated subset of real-world GitHub issues used to evaluate coding agents. Four arms, 50 problems each, single seed.
This is a descriptive comparison, not a statistical claim. Single seed, n=50. No hypothesis tests were run. Results are not comparable to public SWE-bench leaderboard scores (different subset, different harness, different conditions).
| Arm | Description | Solve Rate | Conditional Rate |
|---|---|---|---|
| A | Claude Code solo (no audit) | 74.0% (37/50) | 77.1% |
| B | Codex solo (self-review) | 76.0% (38/50) | 82.6% |
| C | Claude Code + self-audit | 80.0% (40/50) | 81.6% |
| D | Phage (cross-model audit) | 82.0% (41/50) | 85.4% |
Cross-model audit (D) salvaged 3 problems that self-audit (C) missed, while breaking 2 that C solved. Net: +1.
Full methodology and per-problem breakdown: docs/BENCHMARK_EXTERNAL.md
We attempted to measure whether Phage's learning system improves solve rates across grouped SWE-bench problems. It did not. Three groups (Django, Sympy, Sphinx) showed 0pp delta between Frozen and Learning lanes.
Root causes: ceiling effect (Sphinx: 100% both lanes), floor effect (Sympy: 0% both lanes), and task independence — SWE-bench bugs don't share failure patterns that learning could exploit.
We report this as a null result. Learning accumulation remains validated on the Twin Run benchmark (15% speed improvement) but is not measurable on SWE-bench Verified. Details: docs/BENCHMARK_EXTERNAL.md
+--------------------------------------+
| Phage Core |
| 6 phases + safety gates (always) |
+------------------+-------------------+
| Claude Code | Codex CLI |
| plan + build | audit (×2) |
+------------------+-------------------+
- Brainstorm — Clarify requirements through structured questions
- Acceptance criteria — Define binary pass/fail conditions, then freeze them
- Plan + Codex audit #1 — Claude Code plans, Codex audits against frozen criteria
- Execute — Claude Code implements the plan
- Codex final audit #2 — Codex audits the output against frozen criteria
- Lessons — Extract lessons, compress patterns, propose rule promotions, inherit to the next generation
Full spec: docs/OPERATING_SPEC.md
Phage's 3-layer learning architecture was designed, implemented, and audited entirely through the Phage pipeline itself.
Input: "Upgrade the learning system to a 3-layer architecture"
Tool: Phage (the same pipeline being improved)
Output: 3-layer learning with pattern compression and rule promotion
The existing AI memory landscape (Mem0, Vestige, DreamContext, ACE) was surveyed. The missing piece — automatic pattern compression and rule promotion — was the gap Phage filled.
Details: docs/CASE_STUDY.md
The author uses Phage daily on production work — the design-audit-build-learn cycle on real projects. Your mileage may vary depending on project complexity and model behavior.
- Clone this repo
- Open
results/latest/and browse the gallery - Compare Frozen vs Learning side-by-side
- Read the audit cards to see what Codex caught
See examples/quick-start/ to run a single- task preflight on your own machine. You need Claude Code and Codex CLI. Nothing else.
Full setup: CONTRIBUTING.md
Phage's 6-phase pipeline, cross-model auditing, safety gates, and learning system are all operational. The author uses them daily on production work.
Benchmark evidence:
- Twin Run (built-in A/B): 4 mandatory + 5 extended tasks completed. Learning lane 15% faster. All safety gates verified.
- SWE-bench Verified (external): Stage 2 complete (50 problems). Cross-model audit arm (D) achieved 82.0% solve rate. Learning accumulation on SWE-bench was a null result (0pp delta).
- Frozen lane JSON artifact coverage is partial in the first run
(
plan_audit.jsonandquality_checklist.jsonmay be missing). Adapter prompts have been strengthened with inline schemas and per-phase write instructions. - Benchmark currently supports Claude Code + Codex CLI only. For other models, fork and adapt the adapter/reviewer interfaces.
- Learning accumulation is not measurable on SWE-bench Verified due to task independence and ceiling/floor effects. See Stage 4 results.
Phage is not omnipotent. Here's what it can't do.
It stops when it needs you. When API keys, external service authentication, or environment-specific configuration is required, Phage pauses and waits for user input. By design, it never accesses external services without human approval. This is not a limitation — it's an intentional safety decision.
No E2E testing built in. Phage does not integrate browser automation such as Playwright. It can run unit tests and CLI tests, but browser-based E2E testing requires separate setup.
Claude Code + Codex CLI only. For other models, fork and adapt the interfaces yourself.
Phage remains the product name.
As of 2026-04-01, the umbrella brand decision is Blastrum. Spore
remains the sibling product-in-development under the same future parent
brand.
The GitHub organization has been renamed to Blastrum.
Any lingering phage-dev references should be treated as legacy naming,
not current brand architecture.
The naming decision is recorded in the internal brand log.