Skip to content

Latest commit

 

History

History
291 lines (220 loc) · 13.1 KB

File metadata and controls

291 lines (220 loc) · 13.1 KB

loopforge

Prompt engineering is one good instruction. Loop Engineering is the system around it. loopforge tells you which of the six parts your "loop" is missing — then scaffolds and runs one that isn't.

CI Version Python License: MIT

loopforge linting a 'nightly-cleanup' loop that is really just a cron job: it flags a critical no-brake runaway plus missing verification, memory, and handback (grade F), then a scaffolded ci-green loop passes with all six blocks present (grade A).

loopforge is a linter, scaffolder, and runner for agent loops — the autonomous systems that wake up, do work, verify it, remember it, and decide whether to continue. The catch (per Google's Addy Osmani, who named the practice): a real loop is not a cron job that calls an agent. It has six moving parts, and most homemade loops are missing the pieces that stop budget burn, duplicated work, and unverified changes. loopforge makes those parts checkable. Deterministic, zero-dependency, no API key, no LLM call.

Star this if

  • You are about to let an agent run unattended and want brakes before budget burns.
  • You want missing verify, memory, isolation, and handback caught in CI.
  • You want a complete loop scaffold instead of another cron job with a prompt.

What you'd use it for

  • You're building an autonomous loop (a CI-fixer, a content-curation bot, a nightly refactor). Lint it first: loopforge tells you it has no brake, no memory, and reviews its own work — before you let it run unattended. → loopforge lint .
  • You want a correct loop to start from. Scaffold one with all six blocks already wired, so you fill in the work instead of rediscovering the architecture. → loopforge init my-loop
  • You want to run a loop safely. The runner enforces the limits it can actually measure (iterations, wall-clock), records every step to a ledger, and refuses to even start a loop that has no way to stop.loopforge run my-loop/loop.toml

The six blocks of a real loop

A loop has to answer six questions. Miss one and it quietly stops being a loop:

# Block The question it answers loopforge checks
1 trigger Who wakes it up? (schedule / event / until-goal) not self-starting → L001
2 isolation Can parallel agents avoid clobbering each other? (worktrees) parallel, no isolation → L006
3 skills How does it know your conventions? (durable project knowledge) no skills → L007
4 act What does it actually do? (the agent invocation + its tools) no command → L010
5 verify Who checks it — not itself? (a second command / different model) missing or self-review → L004 / L005
6 memory How does it remember? (the model forgets; the repo doesn't) no ledger → L003

…plus the two disciplines the article keeps hammering: a cost brake (L002/L008) so it can't run away, and a human brake (L009) so judgment, acceptance, and the stop button stay with you.

Quickstart

pip install git+https://github.com/yingchen-coding/loopforge

loopforge init ci-green          # scaffold a complete loop (passes lint out of the box)
loopforge lint ci-green --score  # grade any loop A–F on the six blocks
loopforge run ci-green/loop.toml --dry-run   # show exactly what it would do

Point it at someone else's loop, or a whole directory of them:

loopforge lint path/to/loops/     # finds every loop.toml / *.loop.toml and grades each

📖 Hands-on tour: docs/walkthrough.md — scaffold a loop, break it to see each block flagged, wire a real verify step, run it safely, and gate it in CI.

What it catches

A loop that looks fine — it's on a schedule and it calls an agent — but isn't:

$ loopforge lint nightly-cleanup/
nightly-cleanup
  ✖ critical  L002  No hard stop — no iteration cap and no token/time/cost ceiling. A goal alone
                    (`until`) does not count: if the goal is never met, the loop runs forever.
        ↳ fix: Add trigger.max_iterations, or a [budget] with max_seconds / max_cost_usd.
  ✖ major     L004  No verification step — the agent grades its own homework.
  ✖ major     L003  No memory ledger — the model forgets between iterations.
  ✖ major     L009  No handback — nothing ever returns control to a human.

✖ 7 findings — completeness: F (0/100)

And the runner won't let you run that:

$ loopforge run nightly-cleanup/loop.toml
error: refusing to run nightly-cleanup: L002 — the loop has no brake and could run away.

How a loop is defined

One loop.toml, six tables (loopforge init writes a complete one for you):

name = "ci-green"
goal = "Keep main green: when CI fails, find the cause, fix it, verify, and record the fix."

[trigger]                       # 1. who wakes it
type = "schedule"
cron = "*/30 * * * *"
max_iterations = 15             #    ...and what makes it stop

[isolation]                     # 2. parallel-safe workspace
mode = "worktree"

[skills]                        # 3. durable project knowledge
files = ["skills/project.md"]

[act]                           # 4. the agent invocation (harness-agnostic)
command = "agent-cli run {prompt}"
prompt_file = "prompts/act.md"

[verify]                        # 5. an INDEPENDENT check — never the same command as act
command = "pytest -q"

[memory]                        # 6. the ledger the repo keeps
file = "memory/ledger.md"
trace_file = "memory/trace.jsonl" # full commands, outputs, verification, owner, and transitions

[budget]                        # cost brake
max_tokens = 200000
max_cost_usd = 5.0

[handback]                      # human brake
owner = "platform-oncall"       # a concrete person/team owns acceptance and responsibility
on = ["budget-exceeded", "verify-failed-twice", "goal-reached", "needs-human"]
notify = "echo"

{prompt} is assembled from your skills + the memory ledger + the prompt file, so every iteration reloads what the project is and what's already been done. The act command is any agent CLI — loopforge is harness-agnostic.

The rule catalog

$ loopforge list-rules
Code Sev What it catches
L001 major trigger missing or manual — a loop you start by hand isn't a loop
L002 critical no hard stop — no iteration cap or token/time/cost ceiling (a goal alone can loop forever)
L011 minor/major trigger declared but not wired — schedule with no cron, event with no source, typo'd type
L003 major no memory ledger — the model forgets every iteration
L004 major no verification — the agent grades its own homework
L005 major self-review — verify runs the same command/model as act
L006 minor/major no isolation (major if it runs in parallel)
L007 minor no skills/knowledge — re-onboards a new hire every iteration
L008 major iterations capped but no per-run token/cost ceiling
L009 major no handback — nothing returns control to a human
L010 major no act command — the loop does nothing
L012 minor no goal — can't tell progress from motion
L013 major referenced skills/memory/prompt file doesn't exist on disk
L014 major no named owner — the loop can finish, but nobody owns acceptance

Only the runaway is critical, on purpose: a loop that can't stop is the one failure that turns "unattended" into "expensive." Everything else degrades quality; that one burns money.

Cost & boundaries (read this)

Loops are not free, and loopforge won't pretend otherwise — the comments under every Loop Engineering post are people who got a surprise bill. Two honest limits:

  • Token cost is real. A loop re-reads context, retries, and re-verifies every iteration; several agents in parallel multiply it. loopforge enforces the brakes it can measure — iteration count and wall-clock — and surfaces your token/cost ceilings in the plan, but it can't count tokens it never sees. Don't loop a one-off task or one with no stable feedback signal.
  • The loop moves work; it can't hold responsibility. "Done" from an agent isn't done, and "tests pass" isn't "the logic is right." That's why verify must be independent and handback must exist. The point of a loop is to pull you out of the repetitive parts — while judgment, acceptance, and the brake stay in your hands.

Gate it in CI

A ready-made GitHub Action ships in this repo (action.yml) — lint your loop definitions on every PR so a loop can't regress into a runaway unnoticed:

name: loops
on: [push, pull_request]
jobs:
  loopforge:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: yingchen-coding/loopforge@v0.5.3
        with:
          path: loops/        # dir of loop.toml files
          fail-at: major
          score: "true"

Or as a plain step:

- run: pip install git+https://github.com/yingchen-coding/loopforge
- run: loopforge lint loops/ --score

Eval — score predictions against reality

The strongest verify isn't "the command exited 0" — it's "the prediction matched what actually happened." loopforge eval scores a CSV of predictions once their real outcomes are known:

loopforge eval predictions.csv
# predictions: 4  ·  resolved: 3  ·  pending: 1
# accuracy: 67%  (2/3 correct)
# calibration (Brier, lower=better): 0.228
# P&L: +3.20  ·  staked: 12.00  ·  ROI: +26.7%

It's domain-agnostic — soccer bets, stock calls, anything with a predicted value and a later actual (plus optional prob for calibration and stake/odds/result for P&L).

Auto-validate with the latest data. eval is the scoring half; close the loop by pairing it:

  • act = a resolver that fetches the latest real outcomes and fills in actual/result (a results API, yfinance for stock calls, your chart for medical predictions — the domain's ground truth).
  • verify = loopforge eval predictions.csv --min-accuracy 0.5 — the loop fails its own gate when its predictions stop beating the bar.
  • schedule it, and every prediction you make validates itself against reality on a cadence.

See examples/eval-soccer/.

Schedule it

A [trigger].cron is just metadata until something fires it. loopforge schedule installs it into your crontab so the loop runs on its own:

loopforge schedule install my-loop/loop.toml --dry-run   # preview the resulting crontab
loopforge schedule install my-loop/loop.toml             # add it (Mondays 9am, etc.)
loopforge schedule list                                  # show loopforge-managed entries
loopforge schedule remove my-loop                        # take it off the schedule

Entries are tagged # loopforge:<name> and managed idempotently — re-installing replaces, it never duplicates. A no-clobber guard refuses to write if your current crontab can't be read, so your other cron jobs are never wiped. Only type = "schedule" loops (with a cron) can be installed. (macOS: cron needs Full Disk Access to run jobs that touch protected folders like ~/Documents.)

Compose with a reviewer

loopforge is the loop engine; the verify step is any command, so a real reviewer can be the independent check. For agent definitions, pair it with agentguard (a deterministic security linter for agents) — loopforge runs the loop, agentguard reviews the work:

[act]
command = "agent-cli run {prompt}"      # the agent does the work
[verify]
command = "agentguard agents/ --fail-at critical"   # an independent reviewer gates it

Point one loop at a whole fleet of definition sets (agentguard takes multiple paths), and you have a loop that keeps every agent you own clean — the loop engine driving the reviewer, on a schedule.

Why this exists

Loop Engineering is useful because the failure modes are predictable: runaway budget, duplicated work, missing memory, self-review, and no human owner. Predictable failure modes should be lintable, the same way a type checker catches the bugs you'd otherwise find in production. That's all loopforge is: the six blocks, made checkable, with a scaffolder and a safe runner attached.

Install

pip install git+https://github.com/yingchen-coding/loopforge
# or for development:
git clone https://github.com/yingchen-coding/loopforge && cd loopforge && pip install -e ".[dev]"

Python ≥ 3.11, zero runtime dependencies.

License

MIT © Ying Chen