SwarmBench is a structured framework for designing, executing, and evaluating multi-agent AI systems on complex, decomposable tasks.
Core hypothesis: Certain classes of problems are structurally better solved by coordinated agent swarms rather than a single agent.
This repository implements a full pipeline for:
- Task design
- Agent decomposition
- Parallel execution
- Result synthesis
- Deterministic verification
Traditional single-agent systems struggle with:
- Large context windows
- Multi-step coordination
- Coverage across many artifacts
- Consistency during synthesis
SwarmBench addresses this by enabling:
Task β Decomposition β Parallel Agents β Reducer β Verified Output
Where:
- Independent subproblems are solved in parallel
- Outputs are reconciled into a final result
- A verifier enforces correctness and completeness
.
βββ instruction.md # Task definition (agent-facing, role-neutral)
βββ task.toml # Metadata, constraints, execution config
βββ decomposition.yaml # Multi-agent orchestration plan
βββ environment/
β βββ Dockerfile # Runtime environment
βββ tests/
β βββ test.sh # Entry point for verification
β βββ verify.py # Deterministic scoring logic
β βββ judge.py # (Optional) LLM-based evaluation
βββ solution/
β βββ solve.sh # Oracle solution (ground truth)
SwarmBench runs through a controlled pipeline:
- Build container from
environment/Dockerfile - Load task from
instruction.md - Execute agent(s) inside container
- Run verification via
tests/test.sh - Compute reward score (
0.0 β 1.0) - Store logs and outputs
SwarmBench supports structured decomposition strategies:
Fan-out / Synthesize
- Independent agents solve isolated subproblems
- A reducer merges and validates outputs
Map / Reduce
- Work is distributed across shards
- Aggregation ensures global consistency
Defined in decomposition.yaml. Each sub-task specifies:
- Scope of responsibility
- Input artifacts
- Expected output
- Dependencies
SwarmBench enforces objective scoring:
Executable Verifier
- Deterministic checks (files, outputs, structure)
- Weighted scoring system
- Produces
reward.txt
LLM Judge (Optional)
- Evaluates open-ended outputs
- Uses rubric + evidence validation
- Assigns partial credit
The oracle defines the ground truth solution path via solution/solve.sh.
It must:
- Produce the expected output
- Pass all verification checks
- Score
1.0
This guarantees:
- Task validity
- Verifier correctness
- End-to-end consistency
Each task runs in a containerized environment:
- Reproducible execution
- Controlled dependencies
- No hidden state
Defined in environment/Dockerfile.
git clone https://github.com/Leli254/swarmbench
cd harborEnsure Docker is running.
uv run harbor run \
-p /path/to/task \
-a oracle \
-k 1 \
-n 1 \
--job-name "oracle" \
--jobs-dir /path/to/logsExpected result: reward = 1.0
SwarmBench is built around tasks that naturally require:
- Decomposition
- Parallel execution
- Aggregation
- Consistency validation
Strong task patterns include:
- Multi-file code fixes
- Multi-source data extraction
- Cross-dataset reconciliation
- Constraint-heavy planning
Tasks that do not benefit from multi-agent systems:
- Single bug fixes
- Linear workflows
- One-step computations
- Small isolated problems
A well-designed task produces a measurable performance gap driven by structure, not trickery:
| System Type | Expected Score |
|---|---|
| Oracle | 1.0 |
| Multi-Agent | ~1.0 |
| Single-Agent | β€ 0.60 |
Build new tasks by:
- Defining a complex, decomposable problem
- Creating clear sub-task boundaries
- Implementing a verifier with partial scoring
- Providing an oracle solution
- Validating end-to-end execution
- Python 3.11+
- Docker
- Shell scripting
- Structured YAML / TOML configs
- Optional LLM-based evaluation
- Benchmarking multi-agent systems
- Evaluating orchestration strategies
- Stress-testing LLM coordination limits
- Research on distributed AI problem solving
SwarmBench is not about making tasks harder.
It is about making them structurally realistic β the kind of problems that naturally demand coordination, not just intelligence.
MIT License
Michael Leli
- GitHub: github.com/Leli254
- LinkedIn: linkedin.com/in/michael-leli
- Email: lelisoftware@gmail.com