@software{goyal2026two-ai-agents-one-macbook-zero-api-keys,
title = {Two AI Agents, One MacBook, Zero API Keys},
author = {Goyal, Naman},
year = {2026},
month = {Mar},
doi = {10.5281/zenodo.19002113},
url = {https://namangoyal.com/blog/2026/multi-agent-factory/}
}Blog post: Two AI Agents, One MacBook, Zero API Keys — interactive demo, scenario walkthroughs, and design lessons learned.
Two AI agents sit on your MacBook. One writes code, the other runs it and tells the first what broke. They go back and forth until the code works — or until a hard limit stops them.
The whole thing runs locally on Apple Silicon using a single MLX inference server. No API keys, no cloud, no data leaves your machine. I built it to see whether two small local models could actually collaborate the way cloud-based agents do with GPT-4. The short answer is yes — with some interesting caveats about what 7B models can and can't self-correct.
Here's what it looks like in practice:
| Scenario | What it does | Attempts | Time |
|---|---|---|---|
| Calendar generation | Format March 2026 without the calendar module |
1 | 22s |
| CSV analytics pipeline | Generate 200 rows, compute revenue, format as $12,345.67 |
3 | 159s |
| Neural net from scratch | Train a 2-layer MLP with manual backprop (numpy only) | 2 | 50s |
Total wall time: 3 min 50 sec. Total cost: $0.00.
You need macOS with Apple Silicon, Python 3.12+, and uv.
git clone https://github.com/thenamangoyal/multi-agent-mlx.git
cd multi-agent-mlx
uv sync
# Start the local MLX server (downloads ~4 GB model on first run)
uv run factory server start
# In another terminal, give it a task
uv run factory run "Write a Python script that prints the first 20 Fibonacci numbers"Or skip the server management — factory run auto-starts and auto-stops it:
uv run factory run "Write a script that trains a simple MLP on MNIST using MLX"Stop the server when you're done (~4 GB RAM):
uv run factory server stopsrc/factory/
├── cli.py # Typer CLI (factory run, factory server start/stop)
├── config.py # All defaults in one dataclass
├── models.py # Task, ExecutionResult, RunResult
├── orchestrator.py # The Coder → execute → Sheriff loop
├── server.py # MLX server lifecycle
├── agents/
│ ├── base.py # Agno + OpenAILike pointed at localhost
│ ├── coder.py # "Write code for this task"
│ └── sheriff.py # "Here's what happened when I ran it"
└── tools/
├── file_io.py # Sandboxed file read/write
└── executor.py # subprocess.run with timeout + output capture
The important files if you want to change behavior:
orchestrator.py— the main loop. This is where the Coder and Sheriff take turns, where stagnation detection lives, and where the retry logic happens. Start here.agents/coder.pyandagents/sheriff.py— the system prompts that define each agent's personality. Change these to tweak how they write or review code.config.py—max_attempts,execution_timeout,stagnation_threshold, token budgets. All the knobs.tools/executor.py— how scripts get executed in a sandboxed subprocess.
Run the tests with:
uv run pytest -vBoth agents share a single mlx_lm.server instance. They're just two different system prompts hitting the same local OpenAI-compatible endpoint. A Python orchestrator — not an LLM — drives the coordination.
This is the key design constraint for 16 GB machines. Running two separate 7B models would blow past memory. One model loaded once, two identities, sequential turns.
User Task Description
|
v
+-----------+ +-----------+
| Coder | writes code | Sheriff |
| (Agent A)| -------------> | (Agent B)|
| | | |
| System: | error report / | System: |
| "Write | <------------- | "Run & |
| code" | fix requests | test" |
+-----------+ +-----------+
| |
+------ same mlx_lm.server ------+
localhost:8080
(one model, two personas)
Why the orchestrator executes code directly: Small quantized models (4-bit 7B) are unreliable at tool calling — they often dump code in markdown instead of invoking tools. So the orchestrator handles the mechanical part (extraction, execution), and the LLMs focus on what they're actually good at: generating code and analyzing errors.
Each iteration of the loop:
- Coder turn — receives the task (plus the Sheriff's error report on retries) and writes a Python script.
- Execution — the orchestrator (not an LLM) runs the script in a sandboxed subprocess with a timeout, capturing stdout, stderr, and the exit code.
- Sheriff turn — reads the script and execution output, returns a PASS/FAIL verdict with analysis and a suggested fix.
If the Sheriff says PASS, the loop ends. If FAIL, the error report goes back to step 1.
Three safety layers prevent runaway loops:
- Hard limits:
max_attempts=5,execution_timeout=60s,llm_timeout=120s - Stagnation detection: Hashes the last 5 lines of each error. If the same error appears 3 times in a row, the system knows the Coder is stuck and stops.
- Token budget: Hard cap (default 50K) on total tokens across all turns.
| Component | RAM |
|---|---|
| macOS + system | ~3–4 GB |
| Qwen2.5-Coder-7B-4bit | ~4 GB |
| KV cache (2048 prompt) | ~1–2 GB |
| Python + sandbox | ~0.5 GB |
| Total | ~9–10 GB |
That leaves 6–7 GB of headroom on a 16 GB machine.
Override defaults with CLI flags or a factory.yaml in the project root:
server:
model: mlx-community/Qwen2.5-Coder-7B-Instruct-4bit
host: 127.0.0.1
port: 8080
agent:
max_attempts: 5
execution_timeout: 60
llm_timeout: 120
max_tokens_per_turn: 8192
total_token_budget: 100000
temperature: 0.0
stagnation_threshold: 3factory run [DESCRIPTION]
--task, -t PATH Task YAML file
--model, -m TEXT Override model (HuggingFace ID)
--max-attempts, -n INT Max Coder↔Sheriff cycles
--timeout INT Execution timeout per script (seconds)
--no-server Skip auto-starting server
factory server start Start MLX server
factory server status Check if running
factory server stop Stop the server
name: mnist-mlx
description: |
Write a complete Python script that trains a 2-layer MLP on MNIST using MLX.
Print training loss each epoch and final test accuracy.
constraints:
- Use mlx and mlx.nn for the model
- Must complete training in under 120 seconds
max_attempts: 5
timeout: 180Any mlx_lm.server-compatible model works. These are pre-trained checkpoints from HuggingFace — no fine-tuning needed.
| Model | RAM | Notes |
|---|---|---|
| Qwen2.5-Coder-7B-Instruct-4bit | ~4 GB | Default, best for 16 GB |
| Qwen2.5-Coder-1.5B-4bit | ~1 GB | Fast but less capable |
| Qwen2.5-Coder-14B-Instruct-4bit | ~8 GB | Better quality, needs 32 GB+ |
| Qwen2.5-Coder-32B-8bit | ~18 GB | Best quality, needs 64 GB+ |
uv run factory run --model mlx-community/Qwen2.5-Coder-1.5B-4bit "Write hello world"MLX and mlx-lm for inference, Agno for agent orchestration, Typer for the CLI, Rich for terminal output, Pydantic for validation, uv for package management.
MIT