Skip to content

Leli254/swarmbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ SwarmBench β€” Multi-Agent Orchestration & Benchmark Framework

SwarmBench is a structured framework for designing, executing, and evaluating multi-agent AI systems on complex, decomposable tasks.

Core hypothesis: Certain classes of problems are structurally better solved by coordinated agent swarms rather than a single agent.

This repository implements a full pipeline for:

  • Task design
  • Agent decomposition
  • Parallel execution
  • Result synthesis
  • Deterministic verification

🧠 Core Concept

Traditional single-agent systems struggle with:

  • Large context windows
  • Multi-step coordination
  • Coverage across many artifacts
  • Consistency during synthesis

SwarmBench addresses this by enabling:

Task β†’ Decomposition β†’ Parallel Agents β†’ Reducer β†’ Verified Output

Where:

  • Independent subproblems are solved in parallel
  • Outputs are reconciled into a final result
  • A verifier enforces correctness and completeness

βš™οΈ System Architecture

.
β”œβ”€β”€ instruction.md        # Task definition (agent-facing, role-neutral)
β”œβ”€β”€ task.toml             # Metadata, constraints, execution config
β”œβ”€β”€ decomposition.yaml    # Multi-agent orchestration plan
β”œβ”€β”€ environment/
β”‚   └── Dockerfile        # Runtime environment
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test.sh           # Entry point for verification
β”‚   β”œβ”€β”€ verify.py         # Deterministic scoring logic
β”‚   └── judge.py          # (Optional) LLM-based evaluation
β”œβ”€β”€ solution/
β”‚   └── solve.sh          # Oracle solution (ground truth)

πŸ” Execution Flow

SwarmBench runs through a controlled pipeline:

  1. Build container from environment/Dockerfile
  2. Load task from instruction.md
  3. Execute agent(s) inside container
  4. Run verification via tests/test.sh
  5. Compute reward score (0.0 β†’ 1.0)
  6. Store logs and outputs

🧩 Multi-Agent Coordination

SwarmBench supports structured decomposition strategies:

Fan-out / Synthesize

  • Independent agents solve isolated subproblems
  • A reducer merges and validates outputs

Map / Reduce

  • Work is distributed across shards
  • Aggregation ensures global consistency

Defined in decomposition.yaml. Each sub-task specifies:

  • Scope of responsibility
  • Input artifacts
  • Expected output
  • Dependencies

πŸ§ͺ Verification Model

SwarmBench enforces objective scoring:

Executable Verifier

  • Deterministic checks (files, outputs, structure)
  • Weighted scoring system
  • Produces reward.txt

LLM Judge (Optional)

  • Evaluates open-ended outputs
  • Uses rubric + evidence validation
  • Assigns partial credit

🧬 Oracle System

The oracle defines the ground truth solution path via solution/solve.sh.

It must:

  • Produce the expected output
  • Pass all verification checks
  • Score 1.0

This guarantees:

  • Task validity
  • Verifier correctness
  • End-to-end consistency

🐳 Environment Isolation

Each task runs in a containerized environment:

  • Reproducible execution
  • Controlled dependencies
  • No hidden state

Defined in environment/Dockerfile.


πŸš€ Running the System

1. Setup

git clone https://github.com/Leli254/swarmbench
cd harbor

Ensure Docker is running.

2. Run Oracle

uv run harbor run \
  -p /path/to/task \
  -a oracle \
  -k 1 \
  -n 1 \
  --job-name "oracle" \
  --jobs-dir /path/to/logs

Expected result: reward = 1.0


🎯 Design Principles

SwarmBench is built around tasks that naturally require:

  • Decomposition
  • Parallel execution
  • Aggregation
  • Consistency validation

Strong task patterns include:

  • Multi-file code fixes
  • Multi-source data extraction
  • Cross-dataset reconciliation
  • Constraint-heavy planning

⚠️ Anti-Patterns

Tasks that do not benefit from multi-agent systems:

  • Single bug fixes
  • Linear workflows
  • One-step computations
  • Small isolated problems

πŸ“Š Scoring Philosophy

A well-designed task produces a measurable performance gap driven by structure, not trickery:

System Type Expected Score
Oracle 1.0
Multi-Agent ~1.0
Single-Agent ≀ 0.60

🧱 Extending SwarmBench

Build new tasks by:

  1. Defining a complex, decomposable problem
  2. Creating clear sub-task boundaries
  3. Implementing a verifier with partial scoring
  4. Providing an oracle solution
  5. Validating end-to-end execution

πŸ›  Tech Stack

  • Python 3.11+
  • Docker
  • Shell scripting
  • Structured YAML / TOML configs
  • Optional LLM-based evaluation

πŸ“Œ Use Cases

  • Benchmarking multi-agent systems
  • Evaluating orchestration strategies
  • Stress-testing LLM coordination limits
  • Research on distributed AI problem solving

🧭 Philosophy

SwarmBench is not about making tasks harder.

It is about making them structurally realistic β€” the kind of problems that naturally demand coordination, not just intelligence.


πŸ“„ License

MIT License


πŸ‘€ Author

Michael Leli

About

A benchmarking framework for designing and evaluating multi-agent AI systems. Implements structured task decomposition (Map-Reduce/Fan-out), containerized execution, and deterministic verification for complex AI orchestration

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors