DSPy ReAct agent for the BitGN PAC1 benchmark. The agent operates inside a sandboxed PCM filesystem VM through a small set of tools (tree, list_dir, read, write, delete, mkdir, move, find, search, context) and emits a structured outcome (OUTCOME_OK, OUTCOME_DENIED_SECURITY, OUTCOME_NONE_CLARIFICATION, OUTCOME_NONE_UNSUPPORTED, OUTCOME_ERR_INTERNAL) plus a summary and file refs for each task.
Scored 0.42 (42%) on bitgn/pac1-prod — 104 trials, 0 errors, ~105 min wall time.
src/
├── main.py # entrypoint: run all (or filtered) benchmark tasks
├── optimize.py # entrypoint: BootstrapFewShot → MIPROv2 → save state
├── agent.py # PAC1Signature + PAC1Agent (dspy.ReAct wrapper)
├── harness.py # BitGN client, trial lifecycle, parallel batching (size 8)
├── tools.py # PCM runtime tools (filesystem ops bound to a vm client)
├── lm.py # DSPy LM factory (run vs optimize models)
└── config.py # .env loader, typed settings
Two-model split (defaults from .env.template):
RUN_MODEL_ID=openai/gpt-5.4via OpenRouter (RUN_MODEL_API_BASE=https://openrouter.ai/api/v1). Runs every task; this is what gets optimized. Any OpenAI-compatible endpoint works — use theopenai/prefix.OPTIMIZE_MODEL_ID=anthropic/claude-sonnet-4-6. Teacher inBootstrapFewShotandprompt_modelinMIPROv2. Only used duringoptimize.
Optimized prompts are saved to OPTIMIZED_AGENT_PATH (default optimized_agent.json) and auto-loaded on the next run.
Trial loop (harness.run_trial): take a fresh PcmRuntimeClientSync, rebind tools to it via agent.update_tools, build a grounding snapshot (tree + AGENTS.md + UTC date context), call ReAct, then vm.answer(...) with the predicted outcome/summary/refs.
-
Requires Python ≥ 3.14 and
uv. -
Copy
.env.template→.envand fill in:BITGN_API_KEY— BitGN benchmark keyRUN_MODEL_API_KEY— OpenRouter key (or any OpenAI-compatible endpoint paired withRUN_MODEL_API_BASE)ANTHROPIC_API_KEY— only needed foroptimizeBENCHMARK_ID—bitgn/pac1-devfor optimization,bitgn/pac1-prodfor final scoring
-
Install deps:
make sync
Always optimize on
bitgn/pac1-dev, never on prod. MIPROv2 needs many trial calls and would leak signal from the held-out prod set. SwitchBENCHMARK_IDtobitgn/pac1-prodonly for the finalmake run.
# install / refresh deps
make sync
# 1. optimize on DEV (set BENCHMARK_ID=bitgn/pac1-dev in .env) → writes optimized_agent.json
make optimize
# optionally, optimize on a subset of dev tasks
make optimize-task TASKS='t01 t02 t03'
# 2. score on PROD (set BENCHMARK_ID=bitgn/pac1-prod in .env)
make run
# run a subset of tasks
make run-task TASKS='t01 t03'Typical flow: make sync → make optimize (dev) → flip BENCHMARK_ID to prod → make run.
To start fresh (ignore saved prompts), delete optimized_agent.json before make run.
