Summary
Add the option to plug in a local OpenAI-compatible HTTP endpoint as a suggestion backend, so users can point Cotabby at a model already running in Ollama, LM Studio, llama-server, vLLM, or any other server that speaks the OpenAI /v1/completions or /v1/chat/completions API.
Problem
Today the OSS path runs models in-process through the bundled llama.cpp runtime. That's great for zero-config users, but it means:
- Users who already have a tuned local server (Ollama / LM Studio / llama-server / vLLM) have to download and host the model a second time inside Cotabby.
- They can't reuse hardware-specific server flags (quant, ctx size, GPU layers, draft model, speculative decode, etc.) that they've already dialed in.
- Power users can't try models or runtimes that Cotabby doesn't ship support for.
A configurable OpenAI-compatible endpoint sidesteps all of that.
Proposed direction
- New engine option alongside Apple Intelligence and the bundled llama.cpp runtime: "Local OpenAI-compatible endpoint".
- Settings fields: base URL (e.g.
http://localhost:11434/v1), model name, optional API key, optional completion vs chat-completion mode.
- Stay on
localhost / loopback by default and surface a clear warning if a non-loopback host is entered (this is a privacy-sensitive app).
- Route through the existing
SuggestionEngineRouter as a sibling of the llama path; reuse the base-model prompt rendering and cancellation plumbing.
- Stream tokens via SSE so cancellation on focus change still works.
Additional context
- Common targets that already expose this API: Ollama (
/v1), LM Studio, llama-server from llama.cpp, vLLM, LocalAI, text-generation-webui.
- Keep it strictly local-endpoint framing in the UI; this issue is not asking for hosted OpenAI / Anthropic / etc.
Summary
Add the option to plug in a local OpenAI-compatible HTTP endpoint as a suggestion backend, so users can point Cotabby at a model already running in Ollama, LM Studio, llama-server, vLLM, or any other server that speaks the OpenAI
/v1/completionsor/v1/chat/completionsAPI.Problem
Today the OSS path runs models in-process through the bundled llama.cpp runtime. That's great for zero-config users, but it means:
A configurable OpenAI-compatible endpoint sidesteps all of that.
Proposed direction
http://localhost:11434/v1), model name, optional API key, optional completion vs chat-completion mode.localhost/ loopback by default and surface a clear warning if a non-loopback host is entered (this is a privacy-sensitive app).SuggestionEngineRouteras a sibling of the llama path; reuse the base-model prompt rendering and cancellation plumbing.Additional context
/v1), LM Studio,llama-serverfrom llama.cpp, vLLM, LocalAI, text-generation-webui.