Skip to content

datajango/kong-local-llm-setup

Repository files navigation

Kong — Local AI Platform

A comprehensive local AI management platform for running LLMs, image generation, speech, music, 3D modeling, and more on consumer GPU hardware. All inference runs locally — no cloud APIs, no data leaves your machine.

Built for an NVIDIA RTX 4080 (16GB VRAM) / 32GB RAM system, but adaptable to other hardware configurations.


Table of Contents


Architecture

 ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
 │  Web UI  │ │  Tauri   │ │ C++ App  │ │ JUCE App │ │   CLI    │
 │ (React)  │ │ (React)  │ │(SDL/ImGui)│ │ (Audio)  │ │ (Node)   │
 └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
      └──────────┬──┴─────────┬──┴─────────┬──┘             │
                 ▼             ▼             ▼                ▼
              ┌──────────────────────────────────────────────────┐
              │         Fastify API Gateway (:3000)              │
              │   Routes / WebSocket Hub / VRAM Budget           │
              └──────┬──────────────┬─────────────┬─────────────┘
                     ▼              ▼             ▼
              ┌────────────┐ ┌───────────┐ ┌─────────────┐
              │  Ollama    │ │  Express  │ │   FastAPI   │
              │  (:11434)  │ │ Orchestr. │ │  ML Server  │
              │ LLM/Vision │ │  (:3001)  │ │  (:8000)    │
              └────────────┘ │ Jobs/Queue│ │ SD/Whisper/ │
                             └─────┬─────┘ │ Music/3D    │
                              ┌────▼────┐  └─────────────┘
                              │  Redis  │
                              └─────────┘

All frontends communicate exclusively through the Fastify gateway on port 3000. No frontend ever contacts Ollama, Express, or FastAPI directly. This provides a single point for authentication, rate limiting, VRAM budget enforcement, and request routing.

Service Technology Port Role
Gateway Fastify 3000 API gateway, WebSocket hub, VRAM budget enforcement, static file serving
Orchestrator Express 3001 Model lifecycle management, BullMQ job queue, multi-step workflows
ML Server FastAPI (Python) 8000 Non-Ollama ML inference: image generation, speech, music, 3D
Ollama Ollama 11434 LLM and vision model inference
Redis Redis 6379 VRAM state coordination, job queue backend

Repository Structure

kong-local-llm-setup/
├── package.json                  # Root workspace config
├── pnpm-workspace.yaml           # pnpm workspace definition
├── turbo.json                    # Turborepo build orchestration
├── .gitignore
│
├── packages/
│   └── shared/                   # @kong/shared — TypeScript types and constants
│       ├── src/
│       │   ├── types/
│       │   │   ├── chat.ts       # ChatMessage, ChatRequest, ChatResponse, ChatStreamChunk
│       │   │   ├── models.ts     # ModelInfo, ModelCategory, VramStatus, ModelLoadRequest
│       │   │   └── system.ts     # SystemStatus, GpuInfo, BackendStatus
│       │   └── index.ts          # Re-exports + service URL constants
│       ├── package.json
│       └── tsconfig.json
│
├── apps/
│   ├── gateway/                  # @kong/gateway — Fastify API gateway
│   │   ├── src/
│   │   │   ├── server.ts         # Entry point — registers plugins and routes
│   │   │   ├── routes/
│   │   │   │   ├── chat.ts       # POST /api/chat, GET /api/chat/ws
│   │   │   │   ├── models.ts     # GET /api/models, POST /api/models/pull, DELETE /api/models/:name
│   │   │   │   └── system.ts     # GET /api/system
│   │   │   └── services/
│   │   │       └── ollama-client.ts  # Ollama HTTP client (chat, list, pull, delete)
│   │   ├── package.json
│   │   └── tsconfig.json
│   │
│   ├── web/                      # @kong/web — React web UI
│   │   ├── index.html
│   │   ├── vite.config.ts        # Vite config with API proxy to gateway
│   │   ├── src/
│   │   │   ├── main.tsx          # React entry point
│   │   │   ├── index.css         # Tailwind CSS imports + dark theme base
│   │   │   ├── App.tsx           # Root component — sidebar nav, model selector, page router
│   │   │   ├── pages/
│   │   │   │   ├── Chat.tsx      # Chat interface with streaming, stop, clear
│   │   │   │   ├── ModelManager.tsx  # Model list with status, size, category
│   │   │   │   └── SystemMonitor.tsx # GPU stats, VRAM bar, temperature, backend health
│   │   │   └── hooks/
│   │   │       ├── useChat.ts    # SSE streaming chat hook
│   │   │       ├── useModels.ts  # Model list fetching hook
│   │   │       └── useSystem.ts  # System status polling hook
│   │   ├── package.json
│   │   └── tsconfig.json
│   │
│   ├── cli/                      # @kong/cli — command-line interface
│   │   ├── bin/
│   │   │   └── kong.js           # Entry point
│   │   ├── src/commands/
│   │   │   ├── index.ts          # Commander.js program definition
│   │   │   ├── chat.ts           # kong chat [prompt] — interactive or single-shot
│   │   │   ├── models.ts         # kong models list | kong models pull <name>
│   │   │   ├── system.ts         # kong system status
│   │   │   └── serve.ts          # kong serve — start gateway + web UI
│   │   ├── package.json
│   │   └── tsconfig.json
│   │
│   ├── orchestrator/             # (Phase 2) Express workflow engine
│   ├── ml-server/                # (Phase 3) Python FastAPI ML backends
│   ├── desktop/                  # (Phase 5) Tauri desktop app
│   ├── native/                   # (Phase 5) C++ SDL+BGFX+Dear ImGui app
│   └── juce-audio/               # (Phase 4) JUCE 8 audio app
│
├── config/
│   ├── models.yaml               # Model registry — all models across all backends
│   ├── backends.yaml             # Backend service definitions and health checks
│   └── vram-profiles.yaml        # VRAM budget profiles for different workloads
│
├── scripts/
│   ├── setup.sh                  # First-time setup: check prerequisites, install deps, pull models
│   └── dev.sh                    # Start Ollama + all dev servers
│
└── docs/
    ├── 2026-04-10-STATUS-REPORT.md
    └── POSSIBILITIES-AND-AVENUES-OF-AI-RESEARCH.md

Prerequisites

Requirement Minimum Recommended
OS Linux (Ubuntu 22.04+) Ubuntu 24.04
GPU NVIDIA with 8GB VRAM RTX 4080 16GB
RAM 16GB 32GB
NVIDIA Driver 535+ Latest stable
CUDA 12.0+ 13.0
Node.js 20.x 24.x
pnpm 9.x 10.x
Ollama 0.20+ Latest

Install Prerequisites

# Node.js (via nvm)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash
nvm install 24

# pnpm
npm install -g pnpm

# Ollama
curl -fsSL https://ollama.com/install.sh | sh

Quick Start

# Clone
git clone https://github.com/datajango/kong-local-llm-setup.git
cd kong-local-llm-setup

# Setup (installs dependencies + pulls default model)
./scripts/setup.sh

# Start everything
./scripts/dev.sh

# Open in browser
# http://localhost:5173

Or manually:

# Install dependencies
pnpm install

# Ensure Ollama is running
ollama serve &

# Pull the coding model (9GB download)
ollama pull qwen2.5-coder:14b-instruct-q4_K_M

# Start gateway + web UI
pnpm run dev

# Open http://localhost:5173

Apps

Gateway (Fastify)

Package: @kong/gateway | Port: 3000 | Path: apps/gateway/

The central API gateway. Every frontend and CLI command communicates through this service. It proxies requests to Ollama and (in later phases) to the Express orchestrator and Python ML server.

Key files:

  • src/server.ts — Fastify app setup, plugin registration, route mounting
  • src/routes/chat.ts — Chat endpoint with SSE streaming and WebSocket support
  • src/routes/models.ts — Model CRUD operations proxied to Ollama
  • src/routes/system.ts — GPU monitoring via nvidia-smi, backend health checks
  • src/services/ollama-client.ts — Typed HTTP client for the Ollama REST API

Features:

  • CORS enabled for all origins (development)
  • WebSocket support via @fastify/websocket
  • Server-Sent Events (SSE) for streaming chat responses
  • Automatic model category inference from model names

Run independently:

pnpm --filter @kong/gateway dev

Web UI (React)

Package: @kong/web | Port: 5173 | Path: apps/web/

Dark-themed single-page application with three pages accessible from the sidebar.

Pages:

Page Description
Chat Send messages, see streaming responses with typing cursor, stop generation, clear history
Model Manager View all available models with category, parameter count, quantization, VRAM size, and load status
System Monitor Live GPU stats refreshed every 3 seconds — VRAM usage bar (color-coded), temperature, utilization percentage, backend health indicators

Custom Hooks:

Hook Purpose
useChat(model) Manages message state, SSE streaming, abort control
useModels() Fetches model list from gateway, provides refresh
useSystem(pollMs) Polls /api/system at configurable interval

Tech:

  • React 19 with TypeScript
  • Tailwind CSS v4 (via @tailwindcss/vite plugin)
  • Lucide React for icons
  • Vite 6 with dev proxy to gateway (/api -> localhost:3000)

Run independently:

pnpm --filter @kong/web dev

CLI

Package: @kong/cli | Path: apps/cli/

Command-line interface powered by Commander.js. All commands communicate with the gateway API.

Commands:

kong chat [prompt]              Chat with a model (interactive REPL or single prompt)
  -m, --model <model>           Model to use (default: qwen2.5-coder:14b-instruct-q4_K_M)

kong models list                List all available models with status and size
kong models pull <name>         Pull a model with streaming progress display

kong system status              Show GPU info, VRAM usage, backend health

kong serve                      Start gateway + web UI dev servers
  -p, --port <port>             Gateway port (default: 3000)

Examples:

# Single prompt
pnpm --filter @kong/cli dev -- chat "Write a C++ hello world"

# Interactive mode
pnpm --filter @kong/cli dev -- chat

# Check system
pnpm --filter @kong/cli dev -- system status

# List models
pnpm --filter @kong/cli dev -- models list

Environment variables:

  • KONG_GATEWAY — Gateway URL (default: http://localhost:3000)

Orchestrator (Express)

Package: @kong/orchestrator | Port: 3001 | Path: apps/orchestrator/

Status: Phase 2 — not yet implemented.

Planned responsibilities:

  • Model lifecycle management (load, unload, swap)
  • BullMQ job queue for async tasks (image generation, audio processing)
  • Multi-step workflow engine (chain multiple models together)
  • Ollama process supervision (start, stop, restart, health monitoring)

ML Server (FastAPI)

Package: @kong/ml-server | Port: 8000 | Path: apps/ml-server/

Status: Phase 3 — not yet implemented.

Planned responsibilities:

  • Image generation via diffusers (SDXL Turbo, Stable Diffusion XL)
  • Speech-to-text via faster-whisper (Whisper Large v3 Turbo)
  • Text-to-speech via piper-tts
  • Music generation via audiocraft (MusicGen, AudioGen)
  • 3D mesh generation via TripoSR
  • VRAM coordination with gateway via Redis

Desktop (Tauri)

Package: @kong/desktop | Path: apps/desktop/

Status: Phase 5 — not yet implemented.

Tauri 2.0 desktop application wrapping the shared React frontend. Adds native capabilities:

  • System tray with quick actions
  • Native file dialogs for saving generated content
  • Local file system access for model management
  • Native menus and keyboard shortcuts
  • Auto-start and background operation

Native (C++ SDL+BGFX+ImGui)

Package: N/A (CMake) | Path: apps/native/

Status: Phase 5 — not yet implemented.

High-performance native application for real-time workloads:

  • SDL2 for window management and input
  • BGFX for cross-platform GPU rendering
  • Dear ImGui for immediate-mode UI panels
  • libcurl for HTTP communication with the gateway
  • Custom WebSocket client for streaming
  • 3D viewport for rendering generated meshes (BGFX shaders)
  • Real-time GPU monitoring panels
  • Audio capture via SDL for voice input

JUCE Audio

Package: N/A (CMake) | Path: apps/juce-audio/

Status: Phase 4 — not yet implemented.

Standalone audio application built with JUCE 8:

  • Audio engine with real-time processing pipeline
  • Speech panel — microphone capture, STT via Whisper, TTS playback
  • Music panel — MusicGen control, melody input, generation parameters
  • Sound effects panel — AudioGen text-to-SFX generation
  • HTTP client connecting to the gateway for all model requests
  • Audio file management (save/load/export generated audio)

Packages

Shared Types

Package: @kong/shared | Path: packages/shared/

TypeScript type definitions and constants shared across all Node.js apps.

Types:

// Chat
ChatMessage          // { role, content, images? }
ChatRequest          // { model, messages, stream?, temperature?, maxTokens? }
ChatResponse         // { model, message, done, totalDuration?, evalCount? }
ChatStreamChunk      // { model, message, done }

// Models
ModelInfo            // { id, name, category, backend, vramMb, loaded, ... }
ModelCategory        // "coding" | "chat" | "vision" | "image-gen" | "stt" | "tts" | ...
ModelLoadRequest     // { modelId, priority? }
ModelListResponse    // { models, vram }
VramStatus           // { totalMb, usedMb, freeMb, loadedModels }

// System
SystemStatus         // { gpu, vram, backends, uptime }
GpuInfo              // { name, totalMemoryMb, usedMemoryMb, temperature?, utilization? }
BackendStatus        // { name, url, healthy, lastCheck }

Constants:

OLLAMA_URL       = "http://localhost:11434"
GATEWAY_URL      = "http://localhost:3000"
ORCHESTRATOR_URL = "http://localhost:3001"
ML_SERVER_URL    = "http://localhost:8000"

Build:

pnpm --filter @kong/shared build

Models

Ollama Models

Models served by Ollama for text, code, and vision inference.

Model Category Parameters Quantization VRAM Description
qwen2.5-coder:14b-instruct-q4_K_M coding 14.8B Q4_K_M ~9 GB Expert coding — C++, Python, TypeScript, Rust, Go
qwen2.5:14b-instruct-q4_K_M chat 14.8B Q4_K_M ~9 GB General-purpose chat and reasoning
minicpm-v:8b-2.6-q4_K_M vision 8B Q4_K_M ~5 GB Image understanding, technical diagrams, OCR

Pull additional models:

ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull minicpm-v:8b-2.6-q4_K_M

Model Profiles

Profiles combine a base Ollama model with a specialized system prompt. They reuse the same model weights — no additional VRAM required.

Profile Base Model Domain
Electronics Expert qwen2.5-coder:14b ESP32, Raspberry Pi, Arduino, I2C/SPI/UART, circuit design
CAD Expert qwen2.5-coder:14b OpenSCAD, CadQuery, FreeCAD, mechanical engineering

Profiles are defined in config/models.yaml and include the full system prompt.

Python ML Models (Planned)

Models that will be served by the FastAPI ML server in Phase 3+.

Model Category VRAM Backend Library
SDXL Turbo Image generation ~5 GB diffusers
Whisper Large v3 Turbo Speech-to-text ~1.5 GB faster-whisper
Piper TTS Text-to-speech ~100 MB (CPU) piper-tts
MusicGen Small Music generation ~2 GB audiocraft
AudioGen Medium Sound effects ~4 GB audiocraft
TripoSR Image-to-3D mesh ~4 GB triposr

VRAM Management

The RTX 4080 has 16,376 MB of VRAM. After system overhead (~400 MB), the effective budget is approximately 15,500 MB. Only one large model (9 GB) can be loaded at a time, or 2-3 smaller models concurrently.

VRAM Profiles

Predefined model combinations optimized for specific workflows:

Profile Models Total VRAM Use Case
coding Qwen 2.5 Coder 14B + Whisper 10.5 GB Code generation with voice input
creative SDXL Turbo + MusicGen + Piper 7.1 GB Image, music, and speech generation
vision-chat MiniCPM-V 8B 5.0 GB Image analysis and visual Q&A
3d-pipeline TripoSR + MiniCPM-V 8B 9.0 GB Image-to-3D with vision analysis

VRAM Budget

The VRAM manager (planned for Phase 2) will:

  1. Track what models are loaded and their VRAM consumption via Redis
  2. Enforce a VRAM budget — reject requests if insufficient VRAM
  3. Auto-evict least-recently-used models when new models are requested
  4. Never evict a model that is actively streaming a response
  5. Support preemptive loading hints from frontends
  6. Coordinate between Ollama (Node.js) and ML Server (Python) via shared Redis state

API Reference

Base URL: http://localhost:3000

Health

GET /api/health

Response:

{ "status": "ok" }

Chat

REST (SSE Streaming)

POST /api/chat
Content-Type: application/json

Request body:

{
  "model": "qwen2.5-coder:14b-instruct-q4_K_M",
  "messages": [
    { "role": "user", "content": "Write a C++ hello world" }
  ],
  "stream": true,
  "temperature": 0.7,
  "maxTokens": 2048
}

Streaming response (stream: true):

data: {"model":"qwen2.5-coder:14b-instruct-q4_K_M","message":{"role":"assistant","content":"```cpp"},"done":false}
data: {"model":"qwen2.5-coder:14b-instruct-q4_K_M","message":{"role":"assistant","content":"\n#include"},"done":false}
...
data: [DONE]

Non-streaming response (stream: false):

{
  "model": "qwen2.5-coder:14b-instruct-q4_K_M",
  "message": { "role": "assistant", "content": "```cpp\n#include <iostream>..." },
  "done": true
}

WebSocket

GET /api/chat/ws

Send a JSON ChatRequest message. Receive streamed ChatStreamChunk messages, ending with { "done": true }.


Models API

List Models

GET /api/models

Response:

{
  "models": [
    {
      "id": "qwen2.5-coder:14b-instruct-q4_K_M",
      "name": "qwen2.5-coder:14b-instruct-q4_K_M",
      "category": "coding",
      "backend": "ollama",
      "vramMb": 8572,
      "parameterSize": "14.8B",
      "quantization": "Q4_K_M",
      "loaded": false
    }
  ]
}

List Running Models

GET /api/models/running

Response:

{ "running": ["qwen2.5-coder:14b-instruct-q4_K_M"] }

Pull a Model

POST /api/models/pull
Content-Type: application/json

Request body:

{ "name": "minicpm-v:8b-2.6-q4_K_M" }

Response (SSE stream):

data: {"status":"pulling manifest"}
data: {"status":"downloading","completed":1048576,"total":5368709120}
...
data: {"status":"success"}

Delete a Model

DELETE /api/models/:name

Response:

{ "success": true }

System

GET /api/system

Response:

{
  "gpu": {
    "name": "NVIDIA GeForce RTX 4080",
    "totalMemoryMb": 16376,
    "usedMemoryMb": 343,
    "temperature": 34,
    "utilization": 7
  },
  "vram": {
    "totalMb": 16376,
    "usedMb": 343,
    "freeMb": 16033,
    "loadedModels": []
  },
  "backends": [
    {
      "name": "ollama",
      "url": "http://localhost:11434",
      "healthy": true,
      "lastCheck": "2026-04-11T04:32:11.893Z"
    }
  ],
  "uptime": 42.5
}

Configuration

models.yaml

config/models.yaml — Central registry of all models across all backends.

Each model entry contains:

Field Type Description
id string Unique identifier (Ollama model tag or custom ID)
name string Human-readable display name
category string One of: coding, chat, vision, image-gen, stt, tts, music, sound-fx, 3d, electronics, cad
backend string One of: ollama, diffusers, whisper, piper, audiocraft, triposr
vramMb number VRAM consumption in megabytes
description string Short description of capabilities
baseModel string (Profiles only) Underlying Ollama model ID
systemPrompt string (Profiles only) System prompt for specialization

backends.yaml

config/backends.yaml — Backend service definitions.

Each backend entry contains:

Field Type Description
url string Base URL of the service
healthCheck string Path to health check endpoint
description string Short description

vram-profiles.yaml

config/vram-profiles.yaml — Predefined model combinations for specific workflows.

Each profile contains:

Field Type Description
description string What the profile is for
models array List of { id, vramMb } entries
totalVramMb number Sum of all model VRAM requirements

Scripts

Script Description
scripts/setup.sh First-time setup — checks Node.js, pnpm, Ollama are installed; runs pnpm install; pulls the default coding model
scripts/dev.sh Starts Ollama (if not running) then runs pnpm run dev to launch all services via Turborepo

Development

Monorepo Commands

# Install all dependencies
pnpm install

# Start all services in dev mode (gateway + web UI)
pnpm run dev

# Build all packages
pnpm run build

# Build a specific package
pnpm --filter @kong/shared build
pnpm --filter @kong/gateway build

# Run a specific app in dev mode
pnpm --filter @kong/gateway dev
pnpm --filter @kong/web dev

# Clean all build artifacts
pnpm run clean

Adding a New Model

  1. Pull the model via Ollama:
    ollama pull <model-name>
  2. Add an entry to config/models.yaml with the model's ID, category, VRAM size, and description.
  3. The model will automatically appear in the Web UI model selector and CLI models list output.

Adding a New API Route

  1. Create a new file in apps/gateway/src/routes/ (e.g., images.ts).
  2. Export an async function that takes a FastifyInstance and registers routes:
    import type { FastifyInstance } from "fastify";
    
    export async function imageRoutes(app: FastifyInstance) {
      app.post("/api/images/generate", async (request) => {
        // ...
      });
    }
  3. Register the route in apps/gateway/src/server.ts:
    import { imageRoutes } from "./routes/images.js";
    await app.register(imageRoutes);

Adding a New Web UI Page

  1. Create a new page component in apps/web/src/pages/ (e.g., ImageGen.tsx).
  2. Add a nav entry to the NAV_ITEMS array in apps/web/src/App.tsx.
  3. Add the page route in the <main> section of App.tsx.
  4. Create a custom hook in apps/web/src/hooks/ if the page needs API data.

Tech Stack

Runtime & Build

Tool Version Purpose
Node.js 24.x JavaScript/TypeScript runtime
pnpm 10.x Package manager with workspace support
Turborepo 2.x Monorepo build orchestration with caching
TypeScript 5.x Type safety across all Node.js packages
tsx 4.x TypeScript execution for dev mode

Backend

Tool Purpose
Fastify 5 API gateway — high-performance HTTP + WebSocket
@fastify/cors Cross-origin resource sharing
@fastify/websocket WebSocket support for streaming
@fastify/static Static file serving (production)
Ollama LLM inference runtime (wraps llama.cpp)

Frontend

Tool Purpose
React 19 UI component framework
Vite 6 Frontend build tool and dev server
Tailwind CSS 4 Utility-first CSS framework
Lucide React Icon library

CLI

Tool Purpose
Commander.js 13 CLI argument parsing and command structure

Planned

Tool Phase Purpose
Express 2 Workflow orchestrator, job queue
Redis 2 VRAM state, job queue backend (BullMQ)
FastAPI (Python) 3 ML inference server
diffusers 3 Image generation (Stable Diffusion)
faster-whisper 3 Speech-to-text
piper-tts 3 Text-to-speech
audiocraft 4 Music and sound effect generation
JUCE 8 4 Audio application framework
Tauri 2 5 Desktop application shell
SDL2 5 Window management, input, audio capture
BGFX 5 Cross-platform GPU rendering
Dear ImGui 5 Immediate-mode GUI
TripoSR 6 Image-to-3D mesh generation

Roadmap

Phase Focus Status
1. Foundation Ollama + Fastify gateway + React web UI + CLI Complete
2. Model Management + VRAM Redis, VRAM manager, Express orchestrator, BullMQ Planned
3. Python ML Server Image generation, speech (Whisper + Piper), VRAM coordination Planned
4. Audio & Music MusicGen, AudioGen, JUCE 8 audio application Planned
5. Desktop Apps Tauri desktop app, C++ SDL+BGFX+Dear ImGui native app Planned
6. 3D & Advanced TripoSR, workflow engine, CAD/electronics profiles Planned

See docs/2026-04-10-STATUS-REPORT.md for detailed Phase 1 completion report.


Documentation

Document Description
Status Report (2026-04-10) Phase 1 completion report — what was built, tested, and verified
AI Research Possibilities Comprehensive survey of 18 AI research domains with models, capabilities, and open frontiers

License

MIT

About

Local AI management platform - Ollama, Fastify, React, CLI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors