A comprehensive local AI management platform for running LLMs, image generation, speech, music, 3D modeling, and more on consumer GPU hardware. All inference runs locally — no cloud APIs, no data leaves your machine.
Built for an NVIDIA RTX 4080 (16GB VRAM) / 32GB RAM system, but adaptable to other hardware configurations.
- Architecture
- Repository Structure
- Prerequisites
- Quick Start
- Apps
- Packages
- Models
- VRAM Management
- API Reference
- Configuration
- Scripts
- Development
- Tech Stack
- Roadmap
- Documentation
- License
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Web UI │ │ Tauri │ │ C++ App │ │ JUCE App │ │ CLI │
│ (React) │ │ (React) │ │(SDL/ImGui)│ │ (Audio) │ │ (Node) │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
└──────────┬──┴─────────┬──┴─────────┬──┘ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ Fastify API Gateway (:3000) │
│ Routes / WebSocket Hub / VRAM Budget │
└──────┬──────────────┬─────────────┬─────────────┘
▼ ▼ ▼
┌────────────┐ ┌───────────┐ ┌─────────────┐
│ Ollama │ │ Express │ │ FastAPI │
│ (:11434) │ │ Orchestr. │ │ ML Server │
│ LLM/Vision │ │ (:3001) │ │ (:8000) │
└────────────┘ │ Jobs/Queue│ │ SD/Whisper/ │
└─────┬─────┘ │ Music/3D │
┌────▼────┐ └─────────────┘
│ Redis │
└─────────┘
All frontends communicate exclusively through the Fastify gateway on port 3000. No frontend ever contacts Ollama, Express, or FastAPI directly. This provides a single point for authentication, rate limiting, VRAM budget enforcement, and request routing.
| Service | Technology | Port | Role |
|---|---|---|---|
| Gateway | Fastify | 3000 | API gateway, WebSocket hub, VRAM budget enforcement, static file serving |
| Orchestrator | Express | 3001 | Model lifecycle management, BullMQ job queue, multi-step workflows |
| ML Server | FastAPI (Python) | 8000 | Non-Ollama ML inference: image generation, speech, music, 3D |
| Ollama | Ollama | 11434 | LLM and vision model inference |
| Redis | Redis | 6379 | VRAM state coordination, job queue backend |
kong-local-llm-setup/
├── package.json # Root workspace config
├── pnpm-workspace.yaml # pnpm workspace definition
├── turbo.json # Turborepo build orchestration
├── .gitignore
│
├── packages/
│ └── shared/ # @kong/shared — TypeScript types and constants
│ ├── src/
│ │ ├── types/
│ │ │ ├── chat.ts # ChatMessage, ChatRequest, ChatResponse, ChatStreamChunk
│ │ │ ├── models.ts # ModelInfo, ModelCategory, VramStatus, ModelLoadRequest
│ │ │ └── system.ts # SystemStatus, GpuInfo, BackendStatus
│ │ └── index.ts # Re-exports + service URL constants
│ ├── package.json
│ └── tsconfig.json
│
├── apps/
│ ├── gateway/ # @kong/gateway — Fastify API gateway
│ │ ├── src/
│ │ │ ├── server.ts # Entry point — registers plugins and routes
│ │ │ ├── routes/
│ │ │ │ ├── chat.ts # POST /api/chat, GET /api/chat/ws
│ │ │ │ ├── models.ts # GET /api/models, POST /api/models/pull, DELETE /api/models/:name
│ │ │ │ └── system.ts # GET /api/system
│ │ │ └── services/
│ │ │ └── ollama-client.ts # Ollama HTTP client (chat, list, pull, delete)
│ │ ├── package.json
│ │ └── tsconfig.json
│ │
│ ├── web/ # @kong/web — React web UI
│ │ ├── index.html
│ │ ├── vite.config.ts # Vite config with API proxy to gateway
│ │ ├── src/
│ │ │ ├── main.tsx # React entry point
│ │ │ ├── index.css # Tailwind CSS imports + dark theme base
│ │ │ ├── App.tsx # Root component — sidebar nav, model selector, page router
│ │ │ ├── pages/
│ │ │ │ ├── Chat.tsx # Chat interface with streaming, stop, clear
│ │ │ │ ├── ModelManager.tsx # Model list with status, size, category
│ │ │ │ └── SystemMonitor.tsx # GPU stats, VRAM bar, temperature, backend health
│ │ │ └── hooks/
│ │ │ ├── useChat.ts # SSE streaming chat hook
│ │ │ ├── useModels.ts # Model list fetching hook
│ │ │ └── useSystem.ts # System status polling hook
│ │ ├── package.json
│ │ └── tsconfig.json
│ │
│ ├── cli/ # @kong/cli — command-line interface
│ │ ├── bin/
│ │ │ └── kong.js # Entry point
│ │ ├── src/commands/
│ │ │ ├── index.ts # Commander.js program definition
│ │ │ ├── chat.ts # kong chat [prompt] — interactive or single-shot
│ │ │ ├── models.ts # kong models list | kong models pull <name>
│ │ │ ├── system.ts # kong system status
│ │ │ └── serve.ts # kong serve — start gateway + web UI
│ │ ├── package.json
│ │ └── tsconfig.json
│ │
│ ├── orchestrator/ # (Phase 2) Express workflow engine
│ ├── ml-server/ # (Phase 3) Python FastAPI ML backends
│ ├── desktop/ # (Phase 5) Tauri desktop app
│ ├── native/ # (Phase 5) C++ SDL+BGFX+Dear ImGui app
│ └── juce-audio/ # (Phase 4) JUCE 8 audio app
│
├── config/
│ ├── models.yaml # Model registry — all models across all backends
│ ├── backends.yaml # Backend service definitions and health checks
│ └── vram-profiles.yaml # VRAM budget profiles for different workloads
│
├── scripts/
│ ├── setup.sh # First-time setup: check prerequisites, install deps, pull models
│ └── dev.sh # Start Ollama + all dev servers
│
└── docs/
├── 2026-04-10-STATUS-REPORT.md
└── POSSIBILITIES-AND-AVENUES-OF-AI-RESEARCH.md
| Requirement | Minimum | Recommended |
|---|---|---|
| OS | Linux (Ubuntu 22.04+) | Ubuntu 24.04 |
| GPU | NVIDIA with 8GB VRAM | RTX 4080 16GB |
| RAM | 16GB | 32GB |
| NVIDIA Driver | 535+ | Latest stable |
| CUDA | 12.0+ | 13.0 |
| Node.js | 20.x | 24.x |
| pnpm | 9.x | 10.x |
| Ollama | 0.20+ | Latest |
# Node.js (via nvm)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash
nvm install 24
# pnpm
npm install -g pnpm
# Ollama
curl -fsSL https://ollama.com/install.sh | sh# Clone
git clone https://github.com/datajango/kong-local-llm-setup.git
cd kong-local-llm-setup
# Setup (installs dependencies + pulls default model)
./scripts/setup.sh
# Start everything
./scripts/dev.sh
# Open in browser
# http://localhost:5173Or manually:
# Install dependencies
pnpm install
# Ensure Ollama is running
ollama serve &
# Pull the coding model (9GB download)
ollama pull qwen2.5-coder:14b-instruct-q4_K_M
# Start gateway + web UI
pnpm run dev
# Open http://localhost:5173Package: @kong/gateway | Port: 3000 | Path: apps/gateway/
The central API gateway. Every frontend and CLI command communicates through this service. It proxies requests to Ollama and (in later phases) to the Express orchestrator and Python ML server.
Key files:
src/server.ts— Fastify app setup, plugin registration, route mountingsrc/routes/chat.ts— Chat endpoint with SSE streaming and WebSocket supportsrc/routes/models.ts— Model CRUD operations proxied to Ollamasrc/routes/system.ts— GPU monitoring vianvidia-smi, backend health checkssrc/services/ollama-client.ts— Typed HTTP client for the Ollama REST API
Features:
- CORS enabled for all origins (development)
- WebSocket support via
@fastify/websocket - Server-Sent Events (SSE) for streaming chat responses
- Automatic model category inference from model names
Run independently:
pnpm --filter @kong/gateway devPackage: @kong/web | Port: 5173 | Path: apps/web/
Dark-themed single-page application with three pages accessible from the sidebar.
Pages:
| Page | Description |
|---|---|
| Chat | Send messages, see streaming responses with typing cursor, stop generation, clear history |
| Model Manager | View all available models with category, parameter count, quantization, VRAM size, and load status |
| System Monitor | Live GPU stats refreshed every 3 seconds — VRAM usage bar (color-coded), temperature, utilization percentage, backend health indicators |
Custom Hooks:
| Hook | Purpose |
|---|---|
useChat(model) |
Manages message state, SSE streaming, abort control |
useModels() |
Fetches model list from gateway, provides refresh |
useSystem(pollMs) |
Polls /api/system at configurable interval |
Tech:
- React 19 with TypeScript
- Tailwind CSS v4 (via
@tailwindcss/viteplugin) - Lucide React for icons
- Vite 6 with dev proxy to gateway (
/api->localhost:3000)
Run independently:
pnpm --filter @kong/web devPackage: @kong/cli | Path: apps/cli/
Command-line interface powered by Commander.js. All commands communicate with the gateway API.
Commands:
kong chat [prompt] Chat with a model (interactive REPL or single prompt)
-m, --model <model> Model to use (default: qwen2.5-coder:14b-instruct-q4_K_M)
kong models list List all available models with status and size
kong models pull <name> Pull a model with streaming progress display
kong system status Show GPU info, VRAM usage, backend health
kong serve Start gateway + web UI dev servers
-p, --port <port> Gateway port (default: 3000)
Examples:
# Single prompt
pnpm --filter @kong/cli dev -- chat "Write a C++ hello world"
# Interactive mode
pnpm --filter @kong/cli dev -- chat
# Check system
pnpm --filter @kong/cli dev -- system status
# List models
pnpm --filter @kong/cli dev -- models listEnvironment variables:
KONG_GATEWAY— Gateway URL (default:http://localhost:3000)
Package: @kong/orchestrator | Port: 3001 | Path: apps/orchestrator/
Status: Phase 2 — not yet implemented.
Planned responsibilities:
- Model lifecycle management (load, unload, swap)
- BullMQ job queue for async tasks (image generation, audio processing)
- Multi-step workflow engine (chain multiple models together)
- Ollama process supervision (start, stop, restart, health monitoring)
Package: @kong/ml-server | Port: 8000 | Path: apps/ml-server/
Status: Phase 3 — not yet implemented.
Planned responsibilities:
- Image generation via
diffusers(SDXL Turbo, Stable Diffusion XL) - Speech-to-text via
faster-whisper(Whisper Large v3 Turbo) - Text-to-speech via
piper-tts - Music generation via
audiocraft(MusicGen, AudioGen) - 3D mesh generation via TripoSR
- VRAM coordination with gateway via Redis
Package: @kong/desktop | Path: apps/desktop/
Status: Phase 5 — not yet implemented.
Tauri 2.0 desktop application wrapping the shared React frontend. Adds native capabilities:
- System tray with quick actions
- Native file dialogs for saving generated content
- Local file system access for model management
- Native menus and keyboard shortcuts
- Auto-start and background operation
Package: N/A (CMake) | Path: apps/native/
Status: Phase 5 — not yet implemented.
High-performance native application for real-time workloads:
- SDL2 for window management and input
- BGFX for cross-platform GPU rendering
- Dear ImGui for immediate-mode UI panels
- libcurl for HTTP communication with the gateway
- Custom WebSocket client for streaming
- 3D viewport for rendering generated meshes (BGFX shaders)
- Real-time GPU monitoring panels
- Audio capture via SDL for voice input
Package: N/A (CMake) | Path: apps/juce-audio/
Status: Phase 4 — not yet implemented.
Standalone audio application built with JUCE 8:
- Audio engine with real-time processing pipeline
- Speech panel — microphone capture, STT via Whisper, TTS playback
- Music panel — MusicGen control, melody input, generation parameters
- Sound effects panel — AudioGen text-to-SFX generation
- HTTP client connecting to the gateway for all model requests
- Audio file management (save/load/export generated audio)
Package: @kong/shared | Path: packages/shared/
TypeScript type definitions and constants shared across all Node.js apps.
Types:
// Chat
ChatMessage // { role, content, images? }
ChatRequest // { model, messages, stream?, temperature?, maxTokens? }
ChatResponse // { model, message, done, totalDuration?, evalCount? }
ChatStreamChunk // { model, message, done }
// Models
ModelInfo // { id, name, category, backend, vramMb, loaded, ... }
ModelCategory // "coding" | "chat" | "vision" | "image-gen" | "stt" | "tts" | ...
ModelLoadRequest // { modelId, priority? }
ModelListResponse // { models, vram }
VramStatus // { totalMb, usedMb, freeMb, loadedModels }
// System
SystemStatus // { gpu, vram, backends, uptime }
GpuInfo // { name, totalMemoryMb, usedMemoryMb, temperature?, utilization? }
BackendStatus // { name, url, healthy, lastCheck }Constants:
OLLAMA_URL = "http://localhost:11434"
GATEWAY_URL = "http://localhost:3000"
ORCHESTRATOR_URL = "http://localhost:3001"
ML_SERVER_URL = "http://localhost:8000"Build:
pnpm --filter @kong/shared buildModels served by Ollama for text, code, and vision inference.
| Model | Category | Parameters | Quantization | VRAM | Description |
|---|---|---|---|---|---|
qwen2.5-coder:14b-instruct-q4_K_M |
coding | 14.8B | Q4_K_M | ~9 GB | Expert coding — C++, Python, TypeScript, Rust, Go |
qwen2.5:14b-instruct-q4_K_M |
chat | 14.8B | Q4_K_M | ~9 GB | General-purpose chat and reasoning |
minicpm-v:8b-2.6-q4_K_M |
vision | 8B | Q4_K_M | ~5 GB | Image understanding, technical diagrams, OCR |
Pull additional models:
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull minicpm-v:8b-2.6-q4_K_MProfiles combine a base Ollama model with a specialized system prompt. They reuse the same model weights — no additional VRAM required.
| Profile | Base Model | Domain |
|---|---|---|
| Electronics Expert | qwen2.5-coder:14b | ESP32, Raspberry Pi, Arduino, I2C/SPI/UART, circuit design |
| CAD Expert | qwen2.5-coder:14b | OpenSCAD, CadQuery, FreeCAD, mechanical engineering |
Profiles are defined in config/models.yaml and include the full system prompt.
Models that will be served by the FastAPI ML server in Phase 3+.
| Model | Category | VRAM | Backend Library |
|---|---|---|---|
| SDXL Turbo | Image generation | ~5 GB | diffusers |
| Whisper Large v3 Turbo | Speech-to-text | ~1.5 GB | faster-whisper |
| Piper TTS | Text-to-speech | ~100 MB (CPU) | piper-tts |
| MusicGen Small | Music generation | ~2 GB | audiocraft |
| AudioGen Medium | Sound effects | ~4 GB | audiocraft |
| TripoSR | Image-to-3D mesh | ~4 GB | triposr |
The RTX 4080 has 16,376 MB of VRAM. After system overhead (~400 MB), the effective budget is approximately 15,500 MB. Only one large model (9 GB) can be loaded at a time, or 2-3 smaller models concurrently.
Predefined model combinations optimized for specific workflows:
| Profile | Models | Total VRAM | Use Case |
|---|---|---|---|
| coding | Qwen 2.5 Coder 14B + Whisper | 10.5 GB | Code generation with voice input |
| creative | SDXL Turbo + MusicGen + Piper | 7.1 GB | Image, music, and speech generation |
| vision-chat | MiniCPM-V 8B | 5.0 GB | Image analysis and visual Q&A |
| 3d-pipeline | TripoSR + MiniCPM-V 8B | 9.0 GB | Image-to-3D with vision analysis |
The VRAM manager (planned for Phase 2) will:
- Track what models are loaded and their VRAM consumption via Redis
- Enforce a VRAM budget — reject requests if insufficient VRAM
- Auto-evict least-recently-used models when new models are requested
- Never evict a model that is actively streaming a response
- Support preemptive loading hints from frontends
- Coordinate between Ollama (Node.js) and ML Server (Python) via shared Redis state
Base URL: http://localhost:3000
GET /api/health
Response:
{ "status": "ok" }POST /api/chat
Content-Type: application/json
Request body:
{
"model": "qwen2.5-coder:14b-instruct-q4_K_M",
"messages": [
{ "role": "user", "content": "Write a C++ hello world" }
],
"stream": true,
"temperature": 0.7,
"maxTokens": 2048
}Streaming response (stream: true):
data: {"model":"qwen2.5-coder:14b-instruct-q4_K_M","message":{"role":"assistant","content":"```cpp"},"done":false}
data: {"model":"qwen2.5-coder:14b-instruct-q4_K_M","message":{"role":"assistant","content":"\n#include"},"done":false}
...
data: [DONE]
Non-streaming response (stream: false):
{
"model": "qwen2.5-coder:14b-instruct-q4_K_M",
"message": { "role": "assistant", "content": "```cpp\n#include <iostream>..." },
"done": true
}GET /api/chat/ws
Send a JSON ChatRequest message. Receive streamed ChatStreamChunk messages, ending with { "done": true }.
GET /api/models
Response:
{
"models": [
{
"id": "qwen2.5-coder:14b-instruct-q4_K_M",
"name": "qwen2.5-coder:14b-instruct-q4_K_M",
"category": "coding",
"backend": "ollama",
"vramMb": 8572,
"parameterSize": "14.8B",
"quantization": "Q4_K_M",
"loaded": false
}
]
}GET /api/models/running
Response:
{ "running": ["qwen2.5-coder:14b-instruct-q4_K_M"] }POST /api/models/pull
Content-Type: application/json
Request body:
{ "name": "minicpm-v:8b-2.6-q4_K_M" }Response (SSE stream):
data: {"status":"pulling manifest"}
data: {"status":"downloading","completed":1048576,"total":5368709120}
...
data: {"status":"success"}
DELETE /api/models/:name
Response:
{ "success": true }GET /api/system
Response:
{
"gpu": {
"name": "NVIDIA GeForce RTX 4080",
"totalMemoryMb": 16376,
"usedMemoryMb": 343,
"temperature": 34,
"utilization": 7
},
"vram": {
"totalMb": 16376,
"usedMb": 343,
"freeMb": 16033,
"loadedModels": []
},
"backends": [
{
"name": "ollama",
"url": "http://localhost:11434",
"healthy": true,
"lastCheck": "2026-04-11T04:32:11.893Z"
}
],
"uptime": 42.5
}config/models.yaml — Central registry of all models across all backends.
Each model entry contains:
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier (Ollama model tag or custom ID) |
name |
string | Human-readable display name |
category |
string | One of: coding, chat, vision, image-gen, stt, tts, music, sound-fx, 3d, electronics, cad |
backend |
string | One of: ollama, diffusers, whisper, piper, audiocraft, triposr |
vramMb |
number | VRAM consumption in megabytes |
description |
string | Short description of capabilities |
baseModel |
string | (Profiles only) Underlying Ollama model ID |
systemPrompt |
string | (Profiles only) System prompt for specialization |
config/backends.yaml — Backend service definitions.
Each backend entry contains:
| Field | Type | Description |
|---|---|---|
url |
string | Base URL of the service |
healthCheck |
string | Path to health check endpoint |
description |
string | Short description |
config/vram-profiles.yaml — Predefined model combinations for specific workflows.
Each profile contains:
| Field | Type | Description |
|---|---|---|
description |
string | What the profile is for |
models |
array | List of { id, vramMb } entries |
totalVramMb |
number | Sum of all model VRAM requirements |
| Script | Description |
|---|---|
scripts/setup.sh |
First-time setup — checks Node.js, pnpm, Ollama are installed; runs pnpm install; pulls the default coding model |
scripts/dev.sh |
Starts Ollama (if not running) then runs pnpm run dev to launch all services via Turborepo |
# Install all dependencies
pnpm install
# Start all services in dev mode (gateway + web UI)
pnpm run dev
# Build all packages
pnpm run build
# Build a specific package
pnpm --filter @kong/shared build
pnpm --filter @kong/gateway build
# Run a specific app in dev mode
pnpm --filter @kong/gateway dev
pnpm --filter @kong/web dev
# Clean all build artifacts
pnpm run clean- Pull the model via Ollama:
ollama pull <model-name>
- Add an entry to
config/models.yamlwith the model's ID, category, VRAM size, and description. - The model will automatically appear in the Web UI model selector and CLI
models listoutput.
- Create a new file in
apps/gateway/src/routes/(e.g.,images.ts). - Export an async function that takes a
FastifyInstanceand registers routes:import type { FastifyInstance } from "fastify"; export async function imageRoutes(app: FastifyInstance) { app.post("/api/images/generate", async (request) => { // ... }); }
- Register the route in
apps/gateway/src/server.ts:import { imageRoutes } from "./routes/images.js"; await app.register(imageRoutes);
- Create a new page component in
apps/web/src/pages/(e.g.,ImageGen.tsx). - Add a nav entry to the
NAV_ITEMSarray inapps/web/src/App.tsx. - Add the page route in the
<main>section ofApp.tsx. - Create a custom hook in
apps/web/src/hooks/if the page needs API data.
| Tool | Version | Purpose |
|---|---|---|
| Node.js | 24.x | JavaScript/TypeScript runtime |
| pnpm | 10.x | Package manager with workspace support |
| Turborepo | 2.x | Monorepo build orchestration with caching |
| TypeScript | 5.x | Type safety across all Node.js packages |
| tsx | 4.x | TypeScript execution for dev mode |
| Tool | Purpose |
|---|---|
| Fastify 5 | API gateway — high-performance HTTP + WebSocket |
| @fastify/cors | Cross-origin resource sharing |
| @fastify/websocket | WebSocket support for streaming |
| @fastify/static | Static file serving (production) |
| Ollama | LLM inference runtime (wraps llama.cpp) |
| Tool | Purpose |
|---|---|
| React 19 | UI component framework |
| Vite 6 | Frontend build tool and dev server |
| Tailwind CSS 4 | Utility-first CSS framework |
| Lucide React | Icon library |
| Tool | Purpose |
|---|---|
| Commander.js 13 | CLI argument parsing and command structure |
| Tool | Phase | Purpose |
|---|---|---|
| Express | 2 | Workflow orchestrator, job queue |
| Redis | 2 | VRAM state, job queue backend (BullMQ) |
| FastAPI (Python) | 3 | ML inference server |
| diffusers | 3 | Image generation (Stable Diffusion) |
| faster-whisper | 3 | Speech-to-text |
| piper-tts | 3 | Text-to-speech |
| audiocraft | 4 | Music and sound effect generation |
| JUCE 8 | 4 | Audio application framework |
| Tauri 2 | 5 | Desktop application shell |
| SDL2 | 5 | Window management, input, audio capture |
| BGFX | 5 | Cross-platform GPU rendering |
| Dear ImGui | 5 | Immediate-mode GUI |
| TripoSR | 6 | Image-to-3D mesh generation |
| Phase | Focus | Status |
|---|---|---|
| 1. Foundation | Ollama + Fastify gateway + React web UI + CLI | Complete |
| 2. Model Management + VRAM | Redis, VRAM manager, Express orchestrator, BullMQ | Planned |
| 3. Python ML Server | Image generation, speech (Whisper + Piper), VRAM coordination | Planned |
| 4. Audio & Music | MusicGen, AudioGen, JUCE 8 audio application | Planned |
| 5. Desktop Apps | Tauri desktop app, C++ SDL+BGFX+Dear ImGui native app | Planned |
| 6. 3D & Advanced | TripoSR, workflow engine, CAD/electronics profiles | Planned |
See docs/2026-04-10-STATUS-REPORT.md for detailed Phase 1 completion report.
| Document | Description |
|---|---|
| Status Report (2026-04-10) | Phase 1 completion report — what was built, tested, and verified |
| AI Research Possibilities | Comprehensive survey of 18 AI research domains with models, capabilities, and open frontiers |
MIT