A local-first AI routing gateway for macOS. One endpoint, seven backends, automatic intent detection. Queries arrive at a single FastAPI server on port 34750 and get dispatched to whichever local AI engine is best suited for the task -- coding goes to the code model, reasoning goes to the reasoning model, images go to the image generator. No manual model selection required.
Written by Jordan Koch (kochj23).
+-------------------------------+
| Your App / curl / Nova |
+---------------+---------------+
|
HTTP POST
|
v
+---------------------------------------------+
| Nova-NextGen Gateway :34750 |
| |
| Intent Router --> Fallback Cascade |
| Context Bus --> SQLite (aiosqlite) |
| Consensus --> Cosine Similarity |
| Analytics --> Query Log |
+-----+-----+-----+-----+-----+-----+-------+
| | | | | |
+------------+ +--+--+ | +--+--+ | +--+--------+
| | | | | | | | |
v v v v v v v v v
+-----------+ +--------+ +------+ +--------+ +--------+ +--------+
| TinyChat | |MLXCode | | MLX | |OpenWeb | | Ollama | |SwarmUI |
| :8000 | | :37422 | | Chat | | UI | | :11434 | | :7801 |
| gpt-oss | | ANE | |:5000 | | RAG | |deepseek| |Jugger- |
| :20b | | Swift | |Qwen | | :3000 | | r1:8b | |naut XL |
+-----------+ +--------+ |2.5-7B| +--------+ +--------+ +--------+
+------+ |
(fallback)
|
v
+--------+
|ComfyUI |
| :8188 |
|workflow|
+--------+
The gateway inspects each incoming prompt, classifies the task type (coding, reasoning, image generation, etc.), selects the best available backend, and returns the result. If the preferred backend is down, the router cascades through configured fallbacks automatically.
- Automatic intent routing -- keyword analysis maps prompts to the right backend without manual
task_typeselection - Seven backend integrations -- TinyChat, MLXCode, MLX Chat, OpenWebUI, Ollama, SwarmUI, ComfyUI
- Fallback cascading -- if the primary backend is unreachable, the router tries the next best option
- Cross-model consensus validation -- run a prompt through multiple backends and compare outputs using cosine similarity scoring
- Shared context bus -- SQLite-backed key/value store with TTL, injected into prompts automatically
- Session tracking and analytics -- every query is logged with backend, model, latency, and fallback status
- Drop-in Swift client --
AIService.swiftgives any Xcode project async/await access to the gateway - LaunchAgent integration -- installs as a macOS service that starts on login and auto-restarts on crash
- Zero external dependencies at runtime -- all backends are local; no cloud calls from the gateway itself
- Loopback-only by default -- binds to 127.0.0.1; unreachable from other machines unless explicitly configured
| Backend | Port | Model | Strength | Fallback |
|---|---|---|---|---|
| TinyChat | 8000 | gpt-oss:20b | Quick responses, classification, low latency | Ollama |
| MLXCode | 37422 | mlx-local (custom) | Swift, coding, debugging on Apple Neural Engine | MLX Chat, then Ollama |
| MLX Chat | 5000 | Qwen2.5-7B-Instruct-4bit | Fast general text on Apple Silicon | Ollama |
| OpenWebUI | 3000 | qwen3-vl:4b | RAG document retrieval, vision, multimodal | Ollama |
| Ollama | 11434 | deepseek-r1:8b | Reasoning, analysis, generalist default | -- |
| SwarmUI | 7801 | Juggernaut XL | Stable Diffusion SDXL image generation | ComfyUI |
| ComfyUI | 8188 | (workflow-defined) | Node-based image pipelines | -- |
Ollama also serves qwen3-coder:30b (coding fallback), qwen3-vl:4b (vision fallback), gpt-oss:20b (quick fallback), and deepseek-v3.1:671b-cloud (long context).
When task_type is "auto" (the default), the gateway scans the prompt for keywords and resolves a task type. You can also set task_type explicitly.
| Task Type | Primary Backend | Model | Fallback |
|---|---|---|---|
quick |
TinyChat | gpt-oss:20b | Ollama |
coding |
MLXCode | mlx-local | MLX Chat, then Ollama qwen3-coder:30b |
swift |
MLXCode | mlx-local | Ollama qwen3-coder:30b |
general |
Ollama | deepseek-r1:8b | -- |
summarize |
Ollama | deepseek-r1:8b | -- |
creative |
Ollama | deepseek-r1:8b | -- |
reasoning |
Ollama | deepseek-r1:8b | -- |
analysis |
Ollama | deepseek-r1:8b | -- |
vision |
OpenWebUI | qwen3-vl:4b | Ollama qwen3-vl:4b |
document |
OpenWebUI | qwen3-vl:4b | Ollama deepseek-r1:8b |
research |
OpenWebUI | qwen3-vl:4b | Ollama deepseek-r1:8b |
image |
SwarmUI | Juggernaut XL | ComfyUI |
long_context |
Ollama | deepseek-v3.1:671b-cloud | -- |
| Keywords | Resolved Task |
|---|---|
"yes or no", "classify", "tag this", "one word answer" |
quick |
"swift", "swiftui", "xcode", "uikit", "homekit" |
swift |
"write code", "debug", "function", "algorithm", ".py" |
coding |
"in this document", "based on the file", "this pdf" |
document |
"research", "find information about", "background on" |
research |
"why does", "explain why", "step by step", "tradeoffs" |
reasoning |
"analyze", "root cause", "patterns", "diagnosis" |
analysis |
"generate image", "draw me", "render a", "stable diffusion" |
image |
"what is in this image", "describe this photo" |
vision |
"summarize", "tldr", "key points", "main takeaways" |
summarize |
"write a story", "poem", "brainstorm", "creative writing" |
creative |
"entire codebase", "full document", "complete transcript" |
long_context |
| (no match) | general |
- macOS (Apple Silicon recommended; Intel compatible)
- Python 3.12 (not 3.13+; pydantic-core Rust bindings require 3.12)
- At least one backend running (Ollama is the simplest to start)
Install Python 3.12 if needed:
brew install python@3.12git clone https://github.com/kochj23/Nova-NextGen.git
cd Nova-NextGen
bash install.shThe installer:
- Creates a Python 3.12 virtual environment at
~/.nova_gateway/venv - Installs all pinned dependencies from
requirements.txt - Generates a LaunchAgent plist (
com.nova.gateway.plist) - Copies the plist to
~/Library/LaunchAgents/and loads it
The gateway starts immediately and auto-starts on every login.
Verify it is running:
curl http://localhost:34750/health{"status": "ok", "uptime_seconds": 5, "version": "2.0.0"}curl -X POST http://localhost:34750/api/ai/query \
-H "Content-Type: application/json" \
-d '{"query": "Write a Swift function that debounces a Combine publisher"}'The gateway detects "swift" and "Combine" in the prompt, routes to MLXCode (or Ollama qwen3-coder:30b as fallback), and returns:
{
"response": "import Combine\n\nextension Publisher { ... }",
"backend_used": "mlxcode",
"model_used": "mlx-local",
"task_type": "swift",
"tokens_per_second": 42.1,
"fallback_used": false
}# Quick classification
curl -X POST http://localhost:34750/api/ai/query \
-d '{"query": "Is this a valid IPv4: 192.168.1.999", "task_type": "quick"}'
# Deep reasoning
curl -X POST http://localhost:34750/api/ai/query \
-d '{"query": "Analyze the tradeoffs of actor isolation vs GCD", "task_type": "reasoning"}'
# Document-grounded RAG query
curl -X POST http://localhost:34750/api/ai/query \
-d '{"query": "What does section 4.2 specify?", "task_type": "document"}'
# Image generation
curl -X POST http://localhost:34750/api/ai/query \
-d '{"query": "A fox in snow at dusk, cinematic lighting", "task_type": "image"}'curl -X POST http://localhost:34750/api/ai/query \
-d '{"query": "Explain monads", "preferred_backend": "ollama", "model": "deepseek-r1:8b"}'Primary endpoint. Routes a prompt to the best available backend.
Request body:
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Prompt text, 1--100,000 characters |
task_type |
string | "auto" |
One of: quick, coding, swift, general, creative, summarize, document, research, reasoning, analysis, vision, image, long_context, auto |
preferred_backend |
string | null | Force a backend: tinychat, mlxcode, mlxchat, openwebui, ollama, swarmui, comfyui |
model |
string | null | Override the model name (backend-specific) |
session_id |
string | null | Session identifier for context tracking |
context_keys |
string[] | [] |
Keys to inject from the shared context bus |
validate_with |
int | null | 2 or 3: run consensus validation across that many backends |
stream |
bool | false |
Enable streaming (Ollama only) |
options |
object | {} |
Backend-specific options: temperature, max_tokens, system, negative_prompt, width, height, steps, cfg_scale |
Response body:
| Field | Type | Description |
|---|---|---|
response |
string | Model output text |
backend_used |
string | Which backend handled the request |
model_used |
string | Specific model within that backend |
task_type |
string | Resolved task type (useful when "auto" was sent) |
session_id |
string | Session identifier |
tokens_per_second |
float | Generation speed (when available) |
token_count |
int | Output token count (when available) |
validated |
bool | Whether consensus validation ran |
consensus_score |
float | Agreement score 0.0--1.0 (if validated) |
fallback_used |
bool | Whether routing fell back to a secondary backend |
Full gateway snapshot: uptime, version, all backend health with latency, session count, total queries.
{
"status": "running",
"version": "2.0.0",
"port": 34750,
"uptime_seconds": 3842,
"backends": [
{"name": "tinychat", "available": true, "url": "http://localhost:8000", "latency_ms": 2.1},
{"name": "mlxcode", "available": true, "url": "http://localhost:37422", "latency_ms": 4.5},
{"name": "mlxchat", "available": true, "url": "http://localhost:5000", "latency_ms": 3.2},
{"name": "openwebui", "available": true, "url": "http://localhost:3000", "latency_ms": 8.4},
{"name": "ollama", "available": true, "url": "http://localhost:11434", "latency_ms": 9.1},
{"name": "swarmui", "available": false, "url": "http://localhost:7801", "latency_ms": null},
{"name": "comfyui", "available": true, "url": "http://localhost:8188", "latency_ms": 5.3}
],
"active_sessions": 2,
"total_queries": 341
}Returns the backend availability array only (same format as the backends field in /api/ai/status).
Force cross-model consensus. Same request body as /api/ai/query. The prompt is sent to multiple backends and responses are compared via cosine similarity on word-frequency vectors.
{
"consensus": true,
"score": 0.83,
"responses": ["Response from backend 1...", "Response from backend 2..."],
"backends_used": ["ollama", "mlxchat"],
"recommended": "The longer, more detailed response..."
}The consensus_threshold (default 0.7) is configurable in config.yaml. If the score is below the threshold, the result still returns but consensus is false.
Lightweight liveness probe. Returns {"status": "ok"} with uptime and version.
The context bus is a SQLite-backed key/value store scoped by session. You can write context entries, then inject them into future queries automatically.
curl -X POST http://localhost:34750/api/context/write \
-H "Content-Type: application/json" \
-d '{"session_id": "s1", "key": "project_goal", "value": "Build a HomeKit dashboard", "ttl_seconds": 3600}'curl "http://localhost:34750/api/context/read?session_id=s1&key=project_goal"curl "http://localhost:34750/api/context/session?session_id=s1"curl -X DELETE "http://localhost:34750/api/context/session?session_id=s1"curl -X POST http://localhost:34750/api/ai/query \
-d '{
"query": "What Swift frameworks should I use?",
"session_id": "s1",
"context_keys": ["project_goal"]
}'The gateway prepends [Context: project_goal] Build a HomeKit dashboard to the prompt before sending it to the backend.
Limits: key max 256 chars, value max 50,000 chars, TTL 1--86,400 seconds. Expired entries are cleaned up every 5 minutes.
# Last 20 queries with backend, model, latency, and fallback flag
curl "http://localhost:34750/api/analytics/recent?limit=20"
# Aggregate totals (active sessions, total queries, uptime)
curl http://localhost:34750/api/analytics/statsEvery query routed through the gateway is logged to the query_log table in SQLite with: session ID, task type, backend used, model used, prompt/response lengths, latency, fallback status, and validation status.
For high-stakes queries where accuracy matters more than speed, you can run a prompt through multiple backends and compare the outputs.
curl -X POST http://localhost:34750/api/ai/validate \
-H "Content-Type: application/json" \
-d '{"query": "What are the security implications of disabling CSRF protection?", "validate_with": 3}'The validator:
- Sends the prompt to the primary backend (selected by routing rules)
- Sends the same prompt to N-1 additional backends in parallel
- Computes pairwise cosine similarity on word-frequency vectors across all responses
- Returns the average similarity as the consensus score
- Recommends the longest response (more detail is typically better)
Image generation backends (SwarmUI, ComfyUI) are excluded from validation. The timeout for each validator backend is configurable (default 30 seconds).
Drop AIService.swift into any Xcode project to get async/await access to the gateway.
import Foundation
// Basic query with auto-routing
let result = try await AIService.shared.query(
"Write a Swift extension to validate email addresses",
taskType: .swift
)
print(result.response)
print("Backend: \(result.backendUsed), \(result.tokensPerSecond ?? 0) tok/s")
// Quick classification
let answer = try await AIService.shared.query(
"Is this a valid IPv4 address: 192.168.1.999",
taskType: .auto
)
// Shared context injection
try await AIService.shared.writeContext(
session: "s1",
key: "goal",
value: "Build HomeKit dashboard"
)
let advice = try await AIService.shared.query(
"Which Swift frameworks should I use?",
session: "s1",
contextKeys: ["goal"]
)
// Check gateway availability
guard await AIService.shared.isAvailable() else {
print("Gateway is not running")
return
}
// Full gateway status
let status = try await AIService.shared.status()
for backend in status.backends {
print("\(backend.name): \(backend.available ? "up" : "down")")
}The Swift client handles:
- Automatic JSON encoding/decoding with snake_case key mapping
- 120-second request timeout, 300-second resource timeout
- Typed error handling (
AIServiceError.gatewayUnavailable,.backendError,.networkError) - Thread-safe singleton via
@MainActor
All runtime behavior is controlled by config.yaml in the project root.
gateway:
port: 34750
host: "127.0.0.1" # Change to 0.0.0.0 for LAN access
log_level: "INFO"
db_path: "~/.nova_gateway/context.db"
version: "2.1.0"
backends:
tinychat:
url: "http://localhost:8000"
enabled: true
default_model: "gpt-oss:20b"
mlxcode:
url: "http://localhost:37422"
enabled: true
mlxchat:
url: "http://localhost:5000"
enabled: true
default_model: "mlx-community/Qwen2.5-7B-Instruct-4bit"
openwebui:
url: "http://localhost:3000"
enabled: true
default_model: "qwen3-vl:4b"
ollama:
url: "http://localhost:11434"
enabled: true
default_model: "deepseek-r1:8b"
swarmui:
url: "http://localhost:7801"
enabled: true
default_model: "Juggernaut XL"
comfyui:
url: "http://localhost:8188"
enabled: true
routing:
default_backend: "ollama"
default_model: "deepseek-r1:8b"
rules:
- task_type: "quick"
preferred: "tinychat"
fallback: "ollama"
- task_type: "coding"
preferred: "mlxcode"
fallback: "mlxchat"
fallback2: "ollama"
# ... (see config.yaml for full rule set)
context:
ttl_seconds: 3600
max_entries_per_session: 200
cleanup_interval_seconds: 300
validation:
enabled: true
consensus_threshold: 0.7
max_validators: 2
timeout_seconds: 30Restart the gateway after changes:
launchctl stop com.nova.gateway && launchctl start com.nova.gateway# Check if running
launchctl list | grep com.nova.gateway
# Stop
launchctl stop com.nova.gateway
# Start
launchctl start com.nova.gateway
# Disable autostart
launchctl unload ~/Library/LaunchAgents/com.nova.gateway.plist
# Re-enable autostart
launchctl load ~/Library/LaunchAgents/com.nova.gateway.plist
# View logs
tail -f ~/.nova_gateway/gateway.log
tail -f ~/.nova_gateway/gateway.error.log
# Manual start with hot reload (development)
./run.sh --reload --debugNova-NextGen/
|
|-- nova_gateway/
| |-- __init__.py Package init, version string
| |-- main.py FastAPI application, all HTTP routes, lifespan
| |-- router.py Intent detection, backend selection, fallback cascade
| |-- config.py YAML config loader, typed accessors
| |-- models.py Pydantic request/response models, TaskType enum
| |
| |-- backends/
| | |-- __init__.py Backend exports
| | |-- base.py Abstract base class (httpx async client, health_check)
| | |-- tinychat.py TinyChat -- OpenAI-compatible proxy, SSE streaming
| | |-- mlxcode.py MLXCode -- Apple Neural Engine coding app
| | |-- mlxchat.py MLX Chat -- Apple Silicon general inference
| | |-- openwebui.py OpenWebUI -- RAG pipeline, vision, auth support
| | |-- ollama.py Ollama -- reasoning, coding, vision, embeddings
| | |-- swarmui.py SwarmUI -- Stable Diffusion image generation
| | |-- comfyui.py ComfyUI -- node-based image workflows
| |
| |-- context/
| | |-- store.py SQLite context bus (aiosqlite), session tracking, analytics
| |
| |-- validation/
| |-- consensus.py Cosine-similarity cross-model consensus scoring
|
|-- AIService.swift Drop-in Swift client for Xcode projects
|-- config.yaml Runtime configuration (backends, routing, context, validation)
|-- requirements.txt Pinned Python dependencies
|-- install.sh One-shot setup: venv, deps, LaunchAgent registration
|-- run.sh Manual start script with --reload and --debug flags
|-- com.nova.gateway.plist Generated LaunchAgent plist (created by install.sh)
|-- SECURITY.md Threat model and hardening guide
|-- LICENSE MIT License
The Router class in router.py follows this decision order:
- Explicit backend override -- if
preferred_backendis set and that backend is healthy, use it directly - Explicit task type -- if
task_typeis not"auto", look up the matching rule inconfig.yaml - Keyword auto-detection -- scan the prompt against ordered keyword lists; first match wins
- Availability cascade -- if the preferred backend for a rule is down, try
fallback, thenfallback2 - Last resort -- if all rules are exhausted, try backends in priority order: mlxchat, tinychat, ollama, openwebui, mlxcode, comfyui, swarmui
- No backends available -- raise HTTP 503 with an actionable error message
Each backend implements its own health_check() method tailored to its API:
- Ollama:
GET /api/tags(200 = healthy) - MLXCode:
GET /api/status(checksmodelLoadedfield) - MLX Chat:
GET /healthorGET /v1/models - OpenWebUI:
GET /api/version, thenGET /api/models(401 = up but needs auth, still considered available) - TinyChat:
GET /api/health, then root fallback - SwarmUI:
GET /API/GetServerStatus - ComfyUI:
GET /system_stats
Health checks use a 3-second timeout. Latency is measured and reported in the status endpoint.
The context bus uses three SQLite tables:
context_entries-- key/value pairs per session with optional TTL and UPSERT on (session_id, key)query_log-- every routed query with backend, model, latency, fallback/validation flagssessions-- session metadata with creation time, last activity, and query count
A background cleanup task runs every 5 minutes (configurable) to prune expired context entries and stale sessions older than 24 hours.
The validator computes pairwise cosine similarity on word-frequency vectors (bag-of-words). This requires no ML dependencies -- just collections.Counter and basic linear algebra. The average pairwise score across all responses becomes the consensus score. Image backends are excluded from validation.
| Package | Version | Purpose |
|---|---|---|
| fastapi | 0.115.6 | HTTP framework and route handling |
| uvicorn[standard] | 0.34.0 | ASGI server |
| httpx | 0.28.1 | Async HTTP client for backend communication |
| aiosqlite | 0.20.0 | Async SQLite driver for context store |
| pyyaml | 6.0.2 | Configuration file parsing |
| pydantic | 2.10.4 | Request/response validation and serialization |
| python-multipart | 0.0.20 | Form data handling |
| aiofiles | 24.1.0 | Async file operations |
All versions are pinned in requirements.txt.
Gateway will not start:
cat ~/.nova_gateway/gateway.error.logCommon causes: port 34750 already in use (change in config.yaml), wrong Python version (must be 3.12, not 3.13+).
A backend shows as unavailable:
Check that the backend process is running and serving on its expected port:
curl http://localhost:11434/api/tags # Ollama
curl http://localhost:5000/health # MLX Chat
curl http://localhost:37422/api/status # MLXCode
curl http://localhost:3000/api/version # OpenWebUI
curl http://localhost:8000/api/health # TinyChat
curl http://localhost:7801/API/GetServerStatus # SwarmUI
curl http://localhost:8188/system_stats # ComfyUIFirst Ollama query is slow:
Large models (30B+) take 30-60 seconds to load from disk on first invocation. Subsequent queries are fast once the model is resident in memory. The gateway sets a 300-second timeout for Ollama to accommodate this.
OpenWebUI RAG returns no document results:
Documents must be uploaded and indexed through the OpenWebUI web interface at http://localhost:3000 before RAG queries can retrieve them. The gateway sends the query but document indexing is managed by OpenWebUI.
- Binds to
127.0.0.1by default -- unreachable from other machines on the network - CORS restricted to
http://localhostandhttp://127.0.0.1origins - All SQLite queries use parameterized statements -- no string interpolation
- All input validated by Pydantic schemas before processing
- No credentials, API keys, or secrets stored anywhere in this project
- No telemetry, no analytics reporting, no outbound network calls from the gateway
See SECURITY.md for the full threat model, prompt injection considerations, and hardening recommendations for LAN deployment.
MIT License -- Copyright (c) 2026 Jordan Koch. See LICENSE for full text.
Written by Jordan Koch (kochj23).