llama.cpp-utils

A lightweight wrapper, management, and automation toolkit for llama.cpp designed for Linux x86_64 systems.

This project is built around the intent to leverage Vulkan inference for hardware-accelerated LLM execution across a broad range of GPUs (AMD Radeon, NVIDIA GeForce/RTX, Intel Arc) without relying on vendor-specific SDKs (such as CUDA or ROCm). It is specifically configured and optimized to support GPUs with 12 or more GB of VRAM, utilizing 8-bit KV caching, single-model VRAM management, and efficient layer offloading to run modern quantized models smoothly.

Features

Broad GPU Compatibility via Vulkan: Uses Vulkan compute shaders to provide hardware acceleration across AMD, NVIDIA, and Intel GPUs on Linux x86_64.
Optimized for 12GB+ VRAM: Pre-configured defaults (such as 8-bit KV cache quantization q8_0, tuned batch sizes, and single-model concurrency limits) maximize model context and layer offloading within 12GB+ GPU VRAM budgets.
Automated Installation & Updates: Automatically fetches and verifies the latest pre-compiled vulkan-x64 release binaries of llama.cpp.
MCP & WebUI Integration: Launches llama serve pre-configured with --webui-mcp-proxy for seamless integration with Model Context Protocol (MCP) clients and web interfaces.
VRAM & API Utilities: Includes dedicated utilities to test model throughput (llama-ping) and unload inactive models from GPU VRAM (llama-free).

Components

llama-installer: Automates the installation and updating of llama.cpp binaries using official pre-compiled vulkan-x64 builds from GitHub.
llama-launcher: Launches the llama serve daemon using model presets configured in models.ini, enforcing a single loaded model (--models-max 1) to preserve VRAM.
models.ini: The central configuration file for model presets, defining generation parameters, GPU layer offloading, Flash Attention, and memory parameters.
llama-ping: CLI utility to query the OpenAI-compatible /v1/chat/completions API and output response metrics (tokens/second).
llama-free: Utility script to query active models and issue unload requests to free VRAM.

Hardware & System Requirements

Operating System: Linux x86_64 (AMD64)
GPU: Any graphics card with Vulkan 1.2+ support (AMD Radeon, NVIDIA GeForce/RTX, Intel Arc)
VRAM: 12 GB or more of VRAM recommended
Software Dependencies: bash, curl, jq, tar, and working Vulkan drivers (vulkan-tools / Mesa RADV / NVIDIA proprietary drivers)

Quick Start

1. Installation

Install or update llama.cpp using the automated Vulkan installer:

./llama-installer

After installation, update your shell environment if ~/.local/bin was newly added to your PATH:

source ~/.bashrc

2. Start the Server

Launch the llama.cpp server with Vulkan acceleration and MCP proxy support:

./llama-launcher

The server listens on http://0.0.0.0:8080.

3. Test a Model

Send a test prompt to an alias configured in models.ini:

./llama-ping gemma-4-26B-A4B "Explain quantum computing in simple terms"

4. Unload Models from VRAM

To free GPU memory without stopping the server daemon:

./llama-free

Configuration

Model aliases, Hugging Face repositories, offload layers, and VRAM-saving options are configured in models.ini.

Global defaults ([*]) set baseline VRAM-friendly values:

cache-type-k = q8_0 & cache-type-v = q8_0 (quantized KV cache for memory savings)
ctx-size = 32768
ubatch-size = 32 & batch-size = 256

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
llama-free		llama-free
llama-installer		llama-installer
llama-launcher		llama-launcher
llama-ping		llama-ping
models.ini		models.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp-utils

Features

Components

Hardware & System Requirements

Quick Start

1. Installation

2. Start the Server

3. Test a Model

4. Unload Models from VRAM

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama.cpp-utils

Features

Components

Hardware & System Requirements

Quick Start

1. Installation

2. Start the Server

3. Test a Model

4. Unload Models from VRAM

Configuration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages