Multi-agent orchestration framework for code implementation (Commit0) and paper reproduction (PaperBench) benchmarks. Built on OpenHands SDK.
# Clone the repository (with submodules)
git clone --recursive https://github.com/dreamyang-liu/STORM.git
cd STORM/STORM
# Run the setup script (installs deps + builds Docker images)
bash setup.sh
# Set your API key
source .env # edit .env first to fill in LLM_API_KEY and OPENROUTER_API_KEYcd STORM/STORM
# Install Python dependencies
uv sync
# Build Docker image
cd ../software-agent-sdk
docker build \
-f openhands-agent-server/openhands/agent_server/docker/Dockerfile \
--target source-minimal-storm \
--platform linux/amd64 \
-t agent-server:storm-base \
.
cd ../STORM# Agent API (DashScope or OpenRouter)
export LLM_API_KEY=<your-api-key>
export LLM_BASE_URL=https://openrouter.ai/api/v1 # or https://dashscope.aliyuncs.com/compatible-mode/v1
# Judge API (OpenRouter, for PaperBench evaluation)
export OPENROUTER_API_KEY=<your-openrouter-key>
# SDK path
export SDK_SOURCE_DIR=<path-to>/software-agent-sdkDownload the commit0_combined dataset:
# Place at STORM/data/commit0/commit0_combined_disk/Place the PaperBench data from frontier-evals at:
STORM/data/paperbench/papers/
├── rice/
│ ├── config.yaml
│ ├── paper.pdf
│ ├── paper.md
│ ├── rubric.json
│ ├── addendum.md
│ └── blacklist.txt
└── ...
PaperBench judge requires additional packages:
uv pip install -e ../frontier-evals/project/paperbench
uv pip install -e ../frontier-evals/project/common/preparedness_turn_completerbash scripts/run_single.shbash scripts/run_multi.shbash scripts/run_batch.shEdit the parameters at the top of each script (model, task, paper_id/repo, etc.) before running.
| Parameter | Description |
|---|---|
task |
"commit0" or "paperbench" |
model |
LiteLLM model identifier (e.g., openai/deepseek-v4-pro) |
max_subagents |
Number of parallel engineer subagents |
max_iterations |
Maximum LLM iterations for the manager |
sub_iterations |
Maximum LLM iterations per subagent |
rounds_of_chat |
Maximum rounds of task assignment per engineer |
Results are saved to outputs/<task>/<model>/<identifier>/<mode>/<params>/:
cost.json— token usage and cost breakdownruntime.txt— wall-clock runtime in secondsoutputs.jsonl— structured event loggrade.json— (PaperBench) judge evaluation resultsreport.json— (Commit0) pytest results
bash scripts/rejudge.sh <output_dir> [paper1 paper2 ...]We thank the following open-source projects that STORM builds upon:
- OpenHands for the agent SDK framework
- Commit0 for the code implementation benchmark
- PaperBench for the paper reproduction benchmark
@misc{liu2026multiagentcollaborationstatemanagement,
title={Multi-agent Collaboration with State Management},
author={Mengyang Liu and Taozhi Chen and Zhenhua Xu and Xue Jiang and Yihong Dong},
year={2026},
eprint={2605.20563},
archivePrefix={arXiv},
primaryClass={cs.MA},
url={https://arxiv.org/abs/2605.20563},
}