Skip to content

Blastrum/phage

Repository files navigation

Phage

AI dev engine with cross-model auditing and persistent learning.

Throw an idea. Claude Code plans. Codex audits. Claude Code builds. Codex audits again. Lessons carry forward. Ships.


What is Phage?

Phage is an AI development engine designed to drive coding agents toward a finished product with minimal manual intervention. Lessons from each run persist and carry forward to subsequent runs.

You describe what you want built. Phage manages the pipeline: requirements, planning, implementation, auditing, and learning. It pauses for human input when external access or credentials are needed (see Limitations).

Requirement → Design → [Codex audit] → Build → [Codex final audit] → Learn → Ship
                                                                        |
                                                                        ↓
                                                          Inherit lessons to next gen

When multiple runs share the same state directory, lessons accumulate and propagate across projects and sessions.

Why "Phage"?

Named after bacteriophages — biological agents that detect and eliminate specific targets. Phage does the same to code defects, and learns from every cycle.


Who is this for?

Phage is for developers who already use Claude Code or Codex CLI and want to add structured auditing and learning to their AI-assisted workflow.

Prerequisites:

  • Claude Code (Anthropic) — active subscription
  • Codex CLI (OpenAI) — active subscription
  • A Unix-like shell (macOS, Linux, or WSL on Windows)

No additional dependencies. No API keys beyond your existing subscriptions.


Why Claude Code + Codex CLI?

Phage is not a generic framework. It is built for Claude Code + Codex CLI. This is an opinionated choice.

  • Claude Code excels at planning and building. It writes code, manages files, runs tests, and drives projects to completion.
  • Codex CLI serves as the auditor. Using a different model for review means different blind spots, which is the premise behind cross-model auditing.
  • In our SWE-bench evaluation (50 problems), the cross-model setup solved 3 problems that same-model audit missed, while breaking 2 others (net +1). See the benchmark section for details.

No API billing. Both Claude Code and Codex CLI run on your existing subscriptions. No per-token costs.

Want to use different models? Fork it. MIT license. The adapter and reviewer interfaces are documented in CONTRIBUTING.md.


Four pillars

Pillar 1: Socratic requirements definition

"I want something like this" is enough. You don't need a perfect spec.

Phage asks structured questions to dig into your idea and distill it into a requirements document. 3-5 questions per round, up to 5 rounds. Vague input becomes binary (pass/fail) acceptance criteria.

You don't need to be an engineer. Just say what you want in your own words. Phage turns it into a buildable spec.

Pillar 2: Cross-model auditing

Self-review has a known blind spot: the same model that wrote the code tends to overlook the same issues when reviewing it.

Phage uses a different AI model as the auditor. Claude Code writes. Codex reviews. Different model, different blind spots. Exactly twice per task — once on the plan, once on the output. No more, no less.

Role Implementation
Worker (plan + build) Claude Code (Anthropic)
Reviewer (audit) Codex CLI (OpenAI)

Solving the moving goalposts problem

When you let Codex audit code, it invents new edge cases on every pass. New criteria appear. The bar keeps rising. The audit never ends. Implementation stalls. The goalposts move.

Phage solves this with three structural constraints:

1. Frozen criteria. Acceptance criteria are locked after user approval. Nobody — not Codex, not Claude Code — can modify them afterward.

2. Scoped audits. The Codex review prompt explicitly states: "Judge ONLY against the frozen acceptance criteria. Do NOT block for issues outside them." Non-criteria observations go to an observations field — reference material, not blocking reasons.

3. No re-audits. Exactly two audits per task. Plan audit. Final audit. Regardless of the verdict, there is no re-audit. revise → fix and proceed. block → fix and proceed. This design eliminates the infinite review loop by construction.

Pillar 3: Six immutable safety gates

Six checks fire on every task, regardless of what the acceptance criteria say. They are designed to be non-configurable and non-optional — they trigger before the system considers whether the action might be acceptable.

# Gate What it blocks
1 Secrets API keys, tokens, credentials in output
2 Destructive ops File/branch deletion without human approval
3 External sends Emails, webhooks, external API calls
4 Uncommitted changes Starting work on a dirty git tree
5 Doc drift Finishing without updating documentation
6 No rollback Irreversible operations without a recovery path

Safety gates don't just stop. They stop, then auto-implement a local alternative. In the Twin Run benchmark, Task 6 required an external API call. The external_send gate blocked it. Phage automatically switched to schema-based local validation and completed the task with zero outbound network calls.

Details: docs/SAFETY_GATES.md

Pillar 4: Persistent memory and learning across projects

Phage supports persistent learning across projects and sessions when multiple runs share the same PHAGE_STATE_DIR. Under this configuration, lessons persist and propagate across runs.

For the Twin Run benchmark, this behavior is intentionally constrained: each lane gets an isolated state directory so learning stays run-scoped and the A/B comparison remains fair.

In other words:

  • Phage as a system supports cross-project persistence
  • Twin Run as a benchmark uses deliberate run-scoped isolation
Layer 1: Episodes     Raw lessons from each task (always recorded)
Layer 2: Patterns     Auto-compressed when the same lesson appears 3+ times
Layer 3: Promotions   Proposed as permanent rules at 5+ occurrences
                      (human approval required before promotion)

Why three layers?

Learning data grows continuously. Loading everything every time would slow things down. The three-layer architecture solves this:

  • Layer 1 (Episodes) is written to both local and global storage
  • Layer 2 (Patterns) auto-compresses 3+ same-category lessons — 10 raw entries become 1 pattern
  • Layer 3 (Promotions) proposes patterns with 5+ occurrences as permanent rules. But only if the human approves. Never auto-promoted.

At the planning phase (Phase 2), Phage reads: local lessons (all) + global patterns (compressed) + global lessons (same-category + last 20, deduplicated). Because compressed patterns are prioritized, even 100 raw entries result in a lightweight read.

In benchmark mode, these layers accumulate within the current run only. Outside the benchmark, shared state allows them to persist across runs.

Proven: learning inheritance

In the Twin Run benchmark, an XSS fix lesson from Task 1 ("use textContent, not innerHTML — the reviewer will reject it") was automatically applied in Task 4's form validator. The earlier failure prevented a planning rework cycle in the later task.

In this 4-task benchmark, the Learning lane completed 15% faster (1546s vs 1820s) — consistent with reduced planning rework when prior lessons are available.


Zero external dependencies

No npm install. No pip install. No build step.

Phage runs on skill files and CLI tools. You need Claude Code and Codex CLI. That's it.

No environment setup. No dependency hell. Start in minutes.


Twin Run benchmark

Twin Run is Phage's built-in A/B test. It runs the same tasks on two lanes side by side:

  • Frozen: No learning. Every task starts from zero.
  • Learning: Lessons accumulate. Later tasks inherit earlier lessons.

Same tasks, same audits. The comparison shows whether accumulated lessons affect completion time and audit outcomes.

Initial benchmark (mandatory 4 tasks)

Metric Frozen Learning Delta
Tasks succeeded 4/4 4/4
Total time 1820s 1546s -15%
Codex final GO 2/4 3/4 +1
Safety gate fires 0 1 external_send blocked
Quality checklist 100%

Evidence root: results/evidence/mandatory-4-baseline.json Source run: results/runs/20260330-072752

Highlights:

  • Task 3 (CLI Report): Learning lane achieved final GO; Frozen stayed at revise
  • Task 6 (Config Sanitizer): Safety gate correctly blocked an external API send and auto-implemented a local alternative
  • Learning lane was 15% faster overall (4-task run, single seed) — inherited lessons reduced planning rework cycles

Supplementary benchmark (extended-v1, 5 tasks)

Extended run with additional tasks for stability and safety evidence.

Metrics tracked

Metric What it measures
Codex findings Issues caught by the cross-model reviewer
Rework cycles Fix-review loops after audit
Quality checklist Per-task binary pass/fail checks
Completion time Start to final audit GO
Safety gate fires Correct activation of immutable safety gates
Learning transfer Lessons from earlier tasks applied in later ones
Pattern compression Automatic compression of recurring lessons

External benchmark: SWE-bench Verified

We ran Phage against SWE-bench Verified — a curated subset of real-world GitHub issues used to evaluate coding agents. Four arms, 50 problems each, single seed.

This is a descriptive comparison, not a statistical claim. Single seed, n=50. No hypothesis tests were run. Results are not comparable to public SWE-bench leaderboard scores (different subset, different harness, different conditions).

Stage 2 results (50 problems × 4 arms)

Arm Description Solve Rate Conditional Rate
A Claude Code solo (no audit) 74.0% (37/50) 77.1%
B Codex solo (self-review) 76.0% (38/50) 82.6%
C Claude Code + self-audit 80.0% (40/50) 81.6%
D Phage (cross-model audit) 82.0% (41/50) 85.4%

Cross-model audit (D) salvaged 3 problems that self-audit (C) missed, while breaking 2 that C solved. Net: +1.

Full methodology and per-problem breakdown: docs/BENCHMARK_EXTERNAL.md

Stage 4: Learning accumulation — null result

We attempted to measure whether Phage's learning system improves solve rates across grouped SWE-bench problems. It did not. Three groups (Django, Sympy, Sphinx) showed 0pp delta between Frozen and Learning lanes.

Root causes: ceiling effect (Sphinx: 100% both lanes), floor effect (Sympy: 0% both lanes), and task independence — SWE-bench bugs don't share failure patterns that learning could exploit.

We report this as a null result. Learning accumulation remains validated on the Twin Run benchmark (15% speed improvement) but is not measurable on SWE-bench Verified. Details: docs/BENCHMARK_EXTERNAL.md


Six-phase pipeline

+--------------------------------------+
|           Phage Core                 |
|   6 phases + safety gates (always)   |
+------------------+-------------------+
|    Claude Code   |    Codex CLI      |
|   plan + build   |   audit (×2)      |
+------------------+-------------------+
  1. Brainstorm — Clarify requirements through structured questions
  2. Acceptance criteria — Define binary pass/fail conditions, then freeze them
  3. Plan + Codex audit #1 — Claude Code plans, Codex audits against frozen criteria
  4. Execute — Claude Code implements the plan
  5. Codex final audit #2 — Codex audits the output against frozen criteria
  6. Lessons — Extract lessons, compress patterns, propose rule promotions, inherit to the next generation

Full spec: docs/OPERATING_SPEC.md


Dogfooding: Phage built its own learning system

Phage's 3-layer learning architecture was designed, implemented, and audited entirely through the Phage pipeline itself.

Input:  "Upgrade the learning system to a 3-layer architecture"
Tool:   Phage (the same pipeline being improved)
Output: 3-layer learning with pattern compression and rule promotion

The existing AI memory landscape (Mem0, Vestige, DreamContext, ACE) was surveyed. The missing piece — automatic pattern compression and rule promotion — was the gap Phage filled.

Details: docs/CASE_STUDY.md


Battle-tested

The author uses Phage daily on production work — the design-audit-build-learn cycle on real projects. Your mileage may vary depending on project complexity and model behavior.


Getting started

See the results (5 minutes)

  1. Clone this repo
  2. Open results/latest/ and browse the gallery
  3. Compare Frozen vs Learning side-by-side
  4. Read the audit cards to see what Codex caught

Run it yourself (30 minutes)

See examples/quick-start/ to run a single- task preflight on your own machine. You need Claude Code and Codex CLI. Nothing else.

Full setup: CONTRIBUTING.md


Current status

Phage's 6-phase pipeline, cross-model auditing, safety gates, and learning system are all operational. The author uses them daily on production work.

Benchmark evidence:

  • Twin Run (built-in A/B): 4 mandatory + 5 extended tasks completed. Learning lane 15% faster. All safety gates verified.
  • SWE-bench Verified (external): Stage 2 complete (50 problems). Cross-model audit arm (D) achieved 82.0% solve rate. Learning accumulation on SWE-bench was a null result (0pp delta).

Known issues

  • Frozen lane JSON artifact coverage is partial in the first run (plan_audit.json and quality_checklist.json may be missing). Adapter prompts have been strengthened with inline schemas and per-phase write instructions.
  • Benchmark currently supports Claude Code + Codex CLI only. For other models, fork and adapt the adapter/reviewer interfaces.
  • Learning accumulation is not measurable on SWE-bench Verified due to task independence and ceiling/floor effects. See Stage 4 results.

Limitations

Phage is not omnipotent. Here's what it can't do.

It stops when it needs you. When API keys, external service authentication, or environment-specific configuration is required, Phage pauses and waits for user input. By design, it never accesses external services without human approval. This is not a limitation — it's an intentional safety decision.

No E2E testing built in. Phage does not integrate browser automation such as Playwright. It can run unit tests and CLI tests, but browser-based E2E testing requires separate setup.

Claude Code + Codex CLI only. For other models, fork and adapt the interfaces yourself.


Parent Brand Update

Phage remains the product name.

As of 2026-04-01, the umbrella brand decision is Blastrum. Spore remains the sibling product-in-development under the same future parent brand.

The GitHub organization has been renamed to Blastrum. Any lingering phage-dev references should be treated as legacy naming, not current brand architecture.

The naming decision is recorded in the internal brand log.


License

MIT

About

AI dev engine with cross-model auditing and persistent learning. Claude Code + Codex CLI.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors