Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,14 @@ cd ai-stack

The installer auto-detects your GPU (NVIDIA, Intel Arc, or CPU-only), creates `.env`, generates your API key, and starts the stack. Open `http://localhost:40117/` (Shepherd dashboard) when it completes.

**Older Intel iGPU?** (Iris Pro / Iris / UHD / Gen 9 etc., pre-Arc.) `install.sh` falls back to CPU-only for these — ipex-llm doesn't support pre-11th-Gen iGPUs. Use the Vulkan path instead:

```bash
./scripts/install-vulkan-ollama.sh # native Ollama + Vulkan/Mesa ANV
```

See [`docs/hardware/intel-igpu-vulkan.md`](docs/hardware/intel-igpu-vulkan.md) for the procedure and supported hardware list. The rest of ai-stack (Olla, LiteLLM, Router, Shepherd) still runs via `docker compose` — it just connects to the native Ollama on `localhost:11434`.

To add cloud models (Claude, Gemini, OpenCode Zen) after install:
```bash
echo 'ANTHROPIC_API_KEY=sk-ant-...' >> .env
Expand Down
213 changes: 213 additions & 0 deletions docs/hardware/intel-igpu-vulkan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Older Intel iGPU — Vulkan via Mesa ANV (native Ollama)

For Intel iGPUs that **predate the Arc Alchemist generation** (Iris Pro 580 / Iris / UHD / Skylake-era Gen 9 etc.), the
`docker-compose.arc.yml` overlay (ipex-llm + SYCL) does **not** work — the ipex-llm runtime requires 11th-Gen Core+
class hardware. Use the **Vulkan path via Mesa ANV** with a **native (non-Docker) Ollama install**.

This was validated 2026-05-19 on:

- **nuk1** — Intel NUC6i7KYB ("Skull Canyon"), Iris Pro 580 (Gen 9 GT4e, 72 EUs, 128 MB eDRAM), 32 GB RAM
- **lab1, lab2, lab3** — similar Intel NUC hardware class

After this procedure: `ollama ps` shows `100% GPU`, default context jumps from 4096 → 32768, and the GPU is visible
as `library=Vulkan name=Vulkan0 description="Intel(R) Iris(R) Pro Graphics 580" total="23.4 GiB"` (or similar).

## Why this path (not arc.yml)

| Hardware class | Path | Why |
|---|---|---|
| Intel Arc (Alchemist / Battlemage, 12th-Gen+) | `docker-compose.arc.yml` | ipex-llm/SYCL stack tested and supported on this hardware |
| Intel Iris Xe (Gen 12, 11th-Gen Core+) | `docker-compose.arc.yml` *or* this path | Either stack works; Vulkan is broader-supported, ipex-llm is more optimized when it works |
| Intel Iris Pro / Iris / UHD / Gen 9 or older | **this path** (Vulkan) | ipex-llm/SYCL doesn't support pre-11th-Gen iGPUs; Vulkan via Mesa ANV does |
| NVIDIA | `docker-compose.nvidia.yml` | n/a |
| CPU-only | base `docker-compose.yml` | fallback |

The rest of ai-stack (Olla, LiteLLM, Smart Router, Shepherd) runs in Docker as usual; only the Ollama service runs natively,
exposing port 11434 to the localhost where the other services connect.

## Prerequisites

- Ubuntu 24.04 LTS (Noble) or compatible
- `i915` kernel driver active (`lspci -k -s 00:02.0 | grep "Kernel driver in use: i915"`)
- `/dev/dri/renderD128` present (`ls /dev/dri/`)
- User in `render` and `video` groups (the Ollama installer adds these and creates the `ollama` user automatically)

## Procedure

The procedure is automated by `scripts/install-vulkan-ollama.sh` — see that script for the full sequence. Manual steps below
are for operators who want to understand or adapt each step.

### 1. Verify GPU + driver

```bash
lspci -nn | grep -i vga
# Expected: Intel ... Iris ... [8086:...]

sudo lspci -v -s 00:02.0 | grep "Kernel driver"
# Expected: Kernel driver in use: i915

ls /dev/dri/
# Expected: card0 or card1, renderD128
```

### 2. Stop any conflicting Ollama (Docker or older native)

If you previously ran Ollama via `docker-compose.arc.yml` or any other Docker container, it holds port 11434 and
must be stopped:

```bash
# If running via ai-stack:
sudo systemctl stop ai-stack.service 2>/dev/null || true

# Or stop the Docker Ollama container directly:
docker ps -q --filter "publish=11434" | xargs -r docker stop
```

### 3. Add yourself to render/video groups

```bash
sudo usermod -aG render,video "$USER"
# Log out and back in, OR apply to current shell:
newgrp render
```

### 4. Install Ollama (native, via official installer)

```bash
# Optional: clean previous broken install
sudo rm -f /usr/local/bin/ollama
sudo systemctl disable ollama 2>/dev/null || true
sudo rm -f /etc/systemd/system/ollama.service

# Fresh install
curl -fsSL https://ollama.com/install.sh | sh
```

The installer creates the `ollama` user, adds it to `render` + `video` groups, and starts the systemd service.

### 5. Add the Vulkan systemd drop-in

By default the installer warns "No NVIDIA/AMD GPU detected" and falls back to CPU-only. Override with a drop-in:

```bash
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_VULKAN=1"
Environment="RUSTICL_ENABLE=iris"
Environment="GPU_MAX_ALLOC_PERCENT=100"
Environment="OLLAMA_GPU_OVERHEAD=0"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama
```

### 6. Fix model directory permissions

If models were previously downloaded by a different user (e.g. a Docker-based setup with a `root`-owned bind-mount),
fix ownership so the `ollama` user can read them:

```bash
sudo chown -R ollama:ollama /usr/share/ollama/.ollama/
```

### 7. Verify GPU is engaged

```bash
sudo journalctl -u ollama --no-pager -n 50 | grep -i "inference compute"
```

You want to see a line like:

```
inference compute id=... library=Vulkan name=Vulkan0
description="Intel(R) Iris(R) Pro Graphics 580 (SKL GT4)"
type=iGPU total="23.4 GiB" available="13.8 GiB"
```

And the default context should have jumped to 32768 (vs 4096 in CPU mode):

```
vram-based default context total_vram="23.4 GiB" default_num_ctx=32768
```

### 8. Test inference

```bash
ollama pull qwen2.5:1.5b
ollama run qwen2.5:1.5b "Hello"
ollama ps
```

Expected output of `ollama ps`:

```
NAME ID SIZE PROCESSOR CONTEXT
qwen2.5:1.5b ... 2.8 GB 100% GPU 32768
```

`100% GPU` is the success indicator. Anything less means the model partially fell back to CPU — usually because
the model is larger than available VRAM.

## ai-stack integration

The Olla / LiteLLM / Smart Router / Shepherd services in `docker-compose.yml` connect to Ollama at `localhost:11434` —
whether Ollama runs in Docker or natively, they don't care. After this procedure:

```bash
# Start the rest of ai-stack (Ollama already running natively):
docker compose up -d olla litellm router shepherd
```

Or use the same `start.sh` ai-stack provides — set `GPU_TYPE=vulkan` (or `cpu`, since the compose file doesn't need
to know about the native Vulkan Ollama) in `.env`.

## Troubleshooting

### `journalctl -u ollama` shows CPU-only / "No GPU detected"

The drop-in didn't take effect. Verify:

```bash
sudo systemctl show ollama -p Environment | tr ' ' '\n' | grep -i ollama
# Expected lines: Environment=OLLAMA_VULKAN=1 RUSTICL_ENABLE=iris ...
```

If missing, the drop-in file isn't being read. Confirm `/etc/systemd/system/ollama.service.d/override.conf` exists and
re-run `sudo systemctl daemon-reload && sudo systemctl restart ollama`.

### `ollama ps` shows `100% CPU` even though Vulkan is engaged

Model size exceeds VRAM. Either:
- Use a smaller quantization
- Use a smaller model
- Check `available` VRAM in the journal line — 14 GiB free means models <14 GiB fit fully

### `vulkaninfo` hangs

Common on headless systems (no display). Either skip the check or run from a second SSH session.
The `journalctl` line is the authoritative signal — if it says `library=Vulkan`, Vulkan is working.

### Permission denied on `/dev/dri/renderD128`

User isn't in the `render` group. Re-check step 3 and log out / back in (or `newgrp render` in the current shell).

## Hardware-class boundary

This path is for Intel iGPUs **without** ipex-llm/SYCL support — primarily Gen 9 (Skylake era) and older.
For 11th-Gen Core+ / Iris Xe / Arc you can also use this path, but `docker-compose.arc.yml` may give better
performance on supported hardware. Run benchmarks on both paths if you're not sure.

The ipex-llm tested-and-supported hardware floor is documented by Intel as
**iGPUs of 11th Gen Core and newer have been tested; older iGPU works but with poor performance**
([source](https://github.com/intel/ipex-llm) — note: project archived 2026-01-28). For older iGPUs this Vulkan path
is the supported community direction.

## Background

Empirically validated 2026-05-19 by NetYeti on Intel NUC6i7KYB (Iris Pro 580) using free-tier Claude.ai in under
one hour, after the ai-stack pod spent ~10 hours on a parallel SYCL/ipex-llm investigation that concluded the
hardware was "GPU-incapable" (an over-generalization — the hardware is fine; ipex-llm just doesn't support this
generation). Lesson banked: single-stack failure ≠ hardware-class failure. See `docs/hardware/arc.md` for the
Arc path; this file documents the Iris/older-iGPU path.
163 changes: 163 additions & 0 deletions scripts/install-vulkan-ollama.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
#!/usr/bin/env bash
# install-vulkan-ollama.sh — install native Ollama with Vulkan/Mesa ANV GPU support
#
# For Intel iGPUs that predate Arc Alchemist (Iris Pro / Iris / UHD / Gen 9 etc.)
# where docker-compose.arc.yml (ipex-llm/SYCL) does NOT work.
#
# See docs/hardware/intel-igpu-vulkan.md for background and manual procedure.
# Validated 2026-05-19 on Intel NUC6i7KYB (Iris Pro 580) + lab1/2/3 NUC hardware.

set -euo pipefail


info() { echo -e "\033[1;34m[INFO]\033[0m $*"; }
success() { echo -e "\033[1;32m[OK]\033[0m $*"; }
warn() { echo -e "\033[1;33m[WARN]\033[0m $*"; }
error() { echo -e "\033[1;31m[ERROR]\033[0m $*" >&2; }

# ─── Prerequisites check ─────────────────────────────────────────────────────

info "Checking prerequisites..."

if ! command -v lspci &>/dev/null; then
error "lspci not found. Install pciutils first: sudo apt install pciutils"
exit 1
fi

if ! ls /dev/dri/renderD* &>/dev/null; then
error "No /dev/dri/renderD* device found. Is the i915 kernel driver loaded?"
error "Run: sudo lspci -v -s 00:02.0 | grep 'Kernel driver in use'"
exit 1
fi

# Detect Intel iGPU
INTEL_GPU=$(lspci 2>/dev/null | grep -i "vga\|3d\|display" | grep -i "intel" | head -1 || true)
if [[ -z "$INTEL_GPU" ]]; then
error "No Intel GPU detected via lspci. This script targets Intel iGPUs."
exit 1
fi

info "Detected: $INTEL_GPU"

# Warn if this looks like an Arc GPU — arc.yml is likely the better path
if echo "$INTEL_GPU" | grep -iq "arc"; then
warn "This looks like an Intel Arc GPU. docker-compose.arc.yml (ipex-llm/SYCL)"
warn "may give better performance on Arc hardware. Continue anyway? [y/N]"
read -r reply
if [[ ! "$reply" =~ ^[Yy]$ ]]; then
info "Aborted. See docs/hardware/arc.md for the Arc path."
exit 0
fi
fi

# ─── Stop conflicting Ollama (Docker or older native) ────────────────────────

info "Stopping any conflicting Ollama instance on port 11434..."

if command -v docker &>/dev/null; then
CONTAINERS=$(docker ps -q --filter "publish=11434" 2>/dev/null || true)
if [[ -n "$CONTAINERS" ]]; then
info "Stopping Docker container(s) holding port 11434..."
echo "$CONTAINERS" | xargs -r docker stop
fi
fi

# Stop ai-stack systemd service if present (so it doesn't restart the container)
if systemctl list-unit-files 2>/dev/null | grep -q "^ai-stack.service"; then
info "Stopping ai-stack.service so it doesn't restart the Docker Ollama..."
sudo systemctl stop ai-stack.service || true
fi

# ─── Group memberships ───────────────────────────────────────────────────────

info "Adding $USER to render and video groups (required for /dev/dri access)..."
sudo usermod -aG render,video "$USER"

# ─── Install Ollama ──────────────────────────────────────────────────────────

if [[ -x /usr/local/bin/ollama ]]; then
info "Existing Ollama install detected at /usr/local/bin/ollama."
info "Will reuse the existing binary; only the systemd drop-in will change."
else
info "Installing Ollama via official installer..."
curl -fsSL https://ollama.com/install.sh | sh
fi

# ─── Vulkan systemd drop-in ──────────────────────────────────────────────────

info "Writing Vulkan systemd drop-in at /etc/systemd/system/ollama.service.d/override.conf..."

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
Environment="OLLAMA_VULKAN=1"
Environment="RUSTICL_ENABLE=iris"
Environment="GPU_MAX_ALLOC_PERCENT=100"
Environment="OLLAMA_GPU_OVERHEAD=0"
EOF

# ─── Fix model directory ownership ───────────────────────────────────────────

if [[ -d /usr/share/ollama/.ollama ]]; then
info "Ensuring /usr/share/ollama/.ollama is owned by ollama user..."
sudo chown -R ollama:ollama /usr/share/ollama/.ollama/
fi

# ─── Reload + restart ────────────────────────────────────────────────────────

info "Reloading systemd and restarting Ollama..."
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl restart ollama

# ─── Verify ──────────────────────────────────────────────────────────────────

info "Waiting 5s for Ollama to come up..."
sleep 5

info "Checking journal for Vulkan inference-compute line..."
if sudo journalctl -u ollama --no-pager -n 100 2>/dev/null | grep -q "library=Vulkan"; then
success "Vulkan GPU engaged. Journal confirms library=Vulkan."
sudo journalctl -u ollama --no-pager -n 100 | grep "library=Vulkan" | head -2
else
warn "No 'library=Vulkan' line found in recent journal."
warn "This may mean the drop-in didn't take effect or the GPU isn't being used."
warn "Check: sudo journalctl -u ollama --no-pager -n 100"
fi

# Check API
if curl -fsS http://127.0.0.1:11434/api/version &>/dev/null; then
success "Ollama API responding at http://127.0.0.1:11434"
else
error "Ollama API not responding. Check: sudo systemctl status ollama"
exit 1
fi

# ─── Final guidance ──────────────────────────────────────────────────────────

cat <<'EOF'

────────────────────────────────────────────────────────────────────────────
Vulkan Ollama install complete.

Next steps:
1. Pull a model:
ollama pull qwen2.5:1.5b

2. Verify GPU is used (run inference, then check):
ollama run qwen2.5:1.5b "Hello"
ollama ps
# Expect PROCESSOR column to show "100% GPU"

3. Start the rest of ai-stack (Olla, LiteLLM, Smart Router, Shepherd):
docker compose up -d olla litellm router shepherd
# (Native Ollama on port 11434 will be discovered as the local backend.)

If you switched from a Docker-based Ollama:
- Models that were inside the container may need re-pulling, unless you
pre-staged them under /usr/share/ollama/.ollama/models/.

Troubleshooting + background: docs/hardware/intel-igpu-vulkan.md
────────────────────────────────────────────────────────────────────────────
EOF
Loading