Deploy a functional local AI agent with maximum capabilities and minimum friction. Auto-profiles hardware, selects optimal models, installs infrastructure, generates agent scripts, and verifies everything works.
Time to working agent: ~15 minutes (Level 1 / SmolAgents) / ~25 minutes (Level 2 / CrewAI)
v2 is a ground-up redesign based on a retrospective of 28 errors across 6 categories from v1 deployments. The most critical new category was multi-agent cross-contamination — agents clobbering each other's configs, ports, and shared infrastructure.
- Phase 0 DISCOVER — scans the environment before touching anything. Detects existing agents, running services, port conflicts, and WSL2 issues. Never existed in v1.
- Guest mode — deploying alongside an existing agent automatically activates strict isolation: own directory, own ports, additive-only shared configs, no mutation of existing infrastructure.
- Crash-resumable state — every phase writes an append-only
eventsarray. Resume mid-deployment from where it left off. - Fixed model string normalization — v1 silently used wrong model string formats. v2 enforces:
ollama_chat/for SmolAgents,ollama/for CrewAI,ChatOllama(model=...)for LangGraph. - Fixed CrewAI tool format — v1 generated plain functions that CrewAI silently ignored. v2 uses
@tooldecorated functions (from crewai.tools import tool) for CrewAI >= 1.14. - Cross-contamination audit in VERIFY — dedicated Phase 4 check: file boundary scan, existing agent health recheck, shared config integrity, port conflict recheck.
- KV cache tiers — q8_0 default (~2x context), q4_0 standard with sanity check, tq3_0 experimental with mandatory 5-test quality gate and auto-revert.
- Background polling — long installs run as background processes polled by Claude. No more script handoff to the user.
- Persistent registries —
~/qad-agents/.registry.tomland~/qad-agents/.ports.jsontrack all deployed agents across sessions.
Phase 0 DISCOVER -> Phase 1 DETECT -> Phase 2 INSTALL
|
Phase 5 CONNECT <- Phase 4 VERIFY <- Phase 3 SCAFFOLD
Standalone ops (any time, non-destructive):
Model Swap | Mode Swap (Primary <-> Guest) | KV Cache Swap
| Phase | What happens |
|---|---|
| DISCOVER | Registry fast path, environment fingerprinting, existing agent detection, WSL2 pre-flight, mode decision |
| DETECT | GPU/RAM probe, user profiling, model selection (Executor + Reasoner), KV cache recommendation |
| INSTALL | Runtime setup, model pulls with background polling, llama-swap config, quality gates |
| SCAFFOLD | agent-config.json, agent.py or crew.py + YAML, tool implementations, memory hub, launch/stop scripts |
| VERIFY | Health checks, warm-up + benchmark, cross-contamination audit, verify report |
| CONNECT | Extensions menu, orchestrator hookup (OpenAI-compatible + Zora SSE), OpenWebUI pipe, DEPLOYMENT.md |
| Mode | When | Isolation |
|---|---|---|
| Primary | First agent on this machine | Full ownership of shared services |
| Guest | Existing agent(s) already present | Own dir, own ports, read-only shared services, additive-only configs |
| Connected | Joining an existing orchestrator | Registered as a service, full port map |
| Layer | Component | Purpose |
|---|---|---|
| Inference (standard) | Ollama | Models that fit fully in memory |
| Inference (stretch) | llama-server | Precise tensor splitting; required for MoE models (--cpu-moe) |
| Model routing | llama-swap | On-demand Reasoner loading without keeping it hot |
| Interface | OpenWebUI | Browser chat UI, auto-configured pipe file |
| Agent L1 | SmolAgents | Single-agent, code-generation, fast iteration |
| Agent L2 | CrewAI >= 1.14 | Multi-agent with role-based delegation |
| Cloud fallback | Gemini 2.5 Flash | Optional overflow for reasoning-heavy tasks |
| Web search | SearXNG | Local, privacy-preserving search (Docker) |
| VRAM Tier | Executor | Reasoner | Level |
|---|---|---|---|
| <=4GB | SmolLM2-1.7B, Qwen3-0.6B | Gemma3-1B | L1 |
| 6-8GB | Qwen3-8B, Gemma3-4B | Qwen3-8B | L1 |
| 10-12GB | Qwen3-8B (q4_K_M) | Gemma3-12B, Qwen3-14B | L1-L2 |
| 16-20GB | Qwen3-14B | Gemma4-27B, DeepSeek-R1-32B | L2 |
| 24GB+ | Qwen3-14B, Phi-4 Mini | Gemma4-27B, Qwen3-32B | L2 |
| 48GB+ / unified | Gemma4-27B | DeepSeek-R1-70B | L2 |
MoE models (Qwen3-30B-A3B, DeepSeek-R1-671B) require --cpu-moe via llama-server.
Quick-Agent-Deployer/
+-- README.md
+-- CHANGELOG.md
+-- LICENSE
+-- docs/
| +-- audits/
| +-- v1.0-initial-audit.md
| +-- v1.1-post-remediation.md
| +-- v1.2-audit-prompt.md
+-- .claude/skills/
+-- quick-agent-deployer/
+-- SKILL.md Coordinator (~250 lines)
+-- INSTALL.md Setup guide
+-- references/
+-- phase-discover.md Phase 0: Environment scan
+-- phase-detect.md Phase 1: Hardware + model selection
+-- phase-install.md Phase 2: Runtime + quality gates
+-- phase-scaffold.md Phase 3: Agent code generation
+-- phase-verify.md Phase 4: Health + contamination audit
+-- phase-connect.md Phase 5: Extensions + orchestration
+-- standalone-ops.md Model swap, mode swap, KV cache swap
+-- hardware-profiles.md VRAM tiers -> model decision matrix
+-- optimization-playbook.md llama-server flags, KV cache, edge cases
+-- agent-scaffolds.md Complete code templates (all frameworks)
+-- environment-signatures.md Agent fingerprints for DISCOVER
- Open a Cowork session with this project folder selected
- Say "Deploy a local agent" — DISCOVER runs first, then routes based on what's found
- Follow the phase-by-phase walkthrough
- Resume any time — state is on disk in JSON event logs
deploy agent - local AI - run models locally - set up Ollama - agent deployment - quick agent - deploy a local agent - local LLM setup - SmolAgents - CrewAI setup
- "Swap my model" or "switch to a faster model" -> model swap procedure
- "Switch to guest mode" -> mode swap
- "Enable q4_0 cache" or "maximize context" -> KV cache swap
Discover before deploying. v1 jumped straight to hardware detection. v2 checks for existing agents first — cross-contamination was the most common real-world failure.
Guest by default when agents exist. If any existing agent is detected, Guest mode activates automatically. The user can promote to Primary, but the safe default protects existing work.
Additive only in Guest mode. Shared configs (llama-swap, systemd, port maps) are never overwritten — only appended.
Model strings are per-framework. SmolAgents requires ollama_chat/, CrewAI requires ollama/. These look similar but produce silent failures when wrong.
Background polling. start_process + polling loop replaces "here's a script to run." Claude handles the wait.
MIT