A lightweight wrapper, management, and automation toolkit for llama.cpp designed for Linux x86_64 systems.
This project is built around the intent to leverage Vulkan inference for hardware-accelerated LLM execution across a broad range of GPUs (AMD Radeon, NVIDIA GeForce/RTX, Intel Arc) without relying on vendor-specific SDKs (such as CUDA or ROCm). It is specifically configured and optimized to support GPUs with 12 or more GB of VRAM, utilizing 8-bit KV caching, single-model VRAM management, and efficient layer offloading to run modern quantized models smoothly.
- Broad GPU Compatibility via Vulkan: Uses Vulkan compute shaders to provide hardware acceleration across AMD, NVIDIA, and Intel GPUs on Linux x86_64.
- Optimized for 12GB+ VRAM: Pre-configured defaults (such as 8-bit KV cache quantization
q8_0, tuned batch sizes, and single-model concurrency limits) maximize model context and layer offloading within 12GB+ GPU VRAM budgets. - Automated Installation & Updates: Automatically fetches and verifies the latest pre-compiled
vulkan-x64release binaries ofllama.cpp. - MCP & WebUI Integration: Launches
llama servepre-configured with--webui-mcp-proxyfor seamless integration with Model Context Protocol (MCP) clients and web interfaces. - VRAM & API Utilities: Includes dedicated utilities to test model throughput (
llama-ping) and unload inactive models from GPU VRAM (llama-free).
llama-installer: Automates the installation and updating ofllama.cppbinaries using official pre-compiledvulkan-x64builds from GitHub.llama-launcher: Launches thellama servedaemon using model presets configured inmodels.ini, enforcing a single loaded model (--models-max 1) to preserve VRAM.models.ini: The central configuration file for model presets, defining generation parameters, GPU layer offloading, Flash Attention, and memory parameters.llama-ping: CLI utility to query the OpenAI-compatible/v1/chat/completionsAPI and output response metrics (tokens/second).llama-free: Utility script to query active models and issue unload requests to free VRAM.
- Operating System: Linux x86_64 (AMD64)
- GPU: Any graphics card with Vulkan 1.2+ support (AMD Radeon, NVIDIA GeForce/RTX, Intel Arc)
- VRAM: 12 GB or more of VRAM recommended
- Software Dependencies:
bash,curl,jq,tar, and working Vulkan drivers (vulkan-tools/ Mesa RADV / NVIDIA proprietary drivers)
Install or update llama.cpp using the automated Vulkan installer:
./llama-installerAfter installation, update your shell environment if ~/.local/bin was newly added to your PATH:
source ~/.bashrcLaunch the llama.cpp server with Vulkan acceleration and MCP proxy support:
./llama-launcherThe server listens on http://0.0.0.0:8080.
Send a test prompt to an alias configured in models.ini:
./llama-ping gemma-4-26B-A4B "Explain quantum computing in simple terms"To free GPU memory without stopping the server daemon:
./llama-freeModel aliases, Hugging Face repositories, offload layers, and VRAM-saving options are configured in models.ini.
Global defaults ([*]) set baseline VRAM-friendly values:
cache-type-k = q8_0&cache-type-v = q8_0(quantized KV cache for memory savings)ctx-size = 32768ubatch-size = 32&batch-size = 256