rili-live · rililive · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/.changeset/local-models-cost-routing.md b/.changeset/local-models-cost-routing.md
@@ -0,0 +1,26 @@
+---
+"@therr/tiny-code": minor
+---
+
+Add local models and cost-aware, local-first routing.
+
+- **Local (Ollama) provider.** Talk to a local Ollama server over its
+  OpenAI-compatible API (`--provider ollama`), with an idle timeout so a hung
+  model can't freeze the REPL, best-effort token-usage reporting, and configurable
+  `maxTokens`.
+- **Local-first routing.** Set `routing: "local-first"` with an `escalateTo`
+  target to run a cheap/local model by default and escalate heavy turns (or a
+  stuck local model, via the new `escalate` tool) to a frontier model — with full
+  conversation context preserved. Escalation is sticky across follow-up turns.
+- **Model-selection policy** is now owned by a pluggable `ModelDecisionEngine`
+  (`LocalFirstModelEngine`), keeping the agent loop pure mechanism.
+- **Compute awareness.** On startup with a local model, tiny-code estimates RAM
+  need vs. machine capacity and warns when a model likely won't fit or is too
+  small (≤3B) to tool-call reliably; an over-RAM local model is routed to the
+  frontier up front.
+- **Priority-driven model selection.** `priority` (`performance` / `cost` /
+  `balanced`, or `TINY_CODE_PRIORITY`) auto-picks a catalog model when none is
+  pinned.
+- The `/costs` view reports session usage, estimated spend, and routing, and the
+  usage line distinguishes an unpriced *cloud* turn ("cost unknown") from a
+  *local* turn ("no API cost").
diff --git a/.env.example b/.env.example
@@ -1,10 +1,14 @@
-# Provide at least one. If both are present, Anthropic is used by default.
+# Provide at least one for cloud providers. If both are present, Anthropic is
+# the default. Ollama runs locally and needs no key.
 ANTHROPIC_API_KEY=
 GEMINI_API_KEY=
 
 # Optional overrides (also settable via config file / CLI flags)
-# TINY_CODE_PROVIDER=anthropic   # anthropic | gemini
+# TINY_CODE_PROVIDER=anthropic   # anthropic | gemini | ollama
 # TINY_CODE_MODEL=claude-opus-4-8
+# TINY_CODE_OLLAMA_URL=http://localhost:11434/v1   # Ollama OpenAI-compatible endpoint
+# TINY_CODE_PRIORITY=performance # performance | cost | balanced — auto-picks a model when none is pinned
+# TINY_CODE_EFFORT=high          # low | medium | high | xhigh | max — Anthropic thinking budget
 
 # Self-improvement: reflect on sessions and propose markdown-only improvement PRs.
 # On by default; set to 0 to disable. Requires the `gh` CLI installed + authed.

diff --git a/README.md b/README.md
@@ -2,10 +2,13 @@
 
 A small, extensible CLI coding agent built around one constraint: **keep token
 usage low**. As coding-agent costs climb, tiny-code automates the savings so
-you don't have to. Interactive terminal REPL, interchangeable **Anthropic** and
-**Gemini** models, and just the core features you actually use: read/write/edit
-files, run shell commands, search code, and a custom commands/skills system.
-No business logic baked in.
+you don't have to. Interactive terminal REPL, interchangeable **Anthropic**,
+**Gemini**, and **local (Ollama)** models, and just the core features you
+actually use: read/write/edit files, run shell commands, search code, and a
+custom commands/skills system. No business logic baked in.
+
+Run cheap, open-weight models locally and **escalate heavy work to a frontier
+model only when needed** — see [Local models & cost-aware routing](#local-models--cost-aware-routing).
 
 > Status: early (v0.x). Published as `@therr/tiny-code`; the binary is
 > `tiny-code`. Names may change before the first npm publish.
@@ -39,18 +42,61 @@ export GEMINI_API_KEY=...
 tiny-code                       # start the REPL (uses an available key)
 tiny-code --provider gemini     # force a provider
 tiny-code --model claude-opus-4-8
+tiny-code --provider ollama --model gemma3:12b   # run a local model (no API cost)
 ```
 
 In the REPL: type a request, watch it work. Mutating actions (writes, edits,
 shell commands) prompt for approval unless pre-approved in config.
 
 - `/help` — list commands
+- `/costs` — session token usage, estimated $ cost, and cost-saving tips
 - `/clear` — clear the conversation history and start fresh
 - `/models` — show known models, pricing, and the active one (see below)
 - `/improve` — reflect on the session and propose an improvement PR (see below)
 - `/<name> [args]` — run a custom command (see below)
 - `/exit` — quit
 
+## Local models & cost-aware routing
+
+tiny-code talks to a local [Ollama](https://ollama.com) server over its
+OpenAI-compatible API, so any model you've pulled is available — including
+**Google Gemma 3** (`gemma3:4b`, `gemma3:12b`, `gemma3:27b`) and
+`qwen2.5-coder` (the default, which tool-calls reliably).
+
+```bash
+ollama serve
+ollama pull qwen2.5-coder:7b
+tiny-code --provider ollama --model qwen2.5-coder:7b
+```
+
+**Mind the compute cost.** Local models are free of API charges but use your
+machine's RAM/VRAM. On startup with an Ollama model, tiny-code prints how much
+memory the model needs versus what's free, and warns if it likely won't fit or
+if the model is too small (≤3B) to tool-call reliably. Rough guide (≈Q4):
+
+| Model        | ~RAM needed | Good for                          |
+| ------------ | ----------- | --------------------------------- |
+| `gemma3:1b`  | ~1 GB       | trivial text (poor at tool calls) |
+| `gemma3:4b`  | ~3 GB       | lightweight edits, search         |
+| `gemma3:12b` | ~7 GB       | most coding tasks                 |
+| `gemma3:27b` | ~16 GB      | stronger reasoning                |
+
+**Local-first routing.** Set a `routing` of `local-first` with an `escalateTo`
+target: every turn starts on the cheap/local model, and tiny-code escalates to
+the frontier model when a turn looks heavy (refactors, debugging, multi-file
+work) or when the local model gets stuck and calls the built-in `escalate` tool.
+You get local speed and zero cost for the bulk of the work, and frontier power
+only for the hard parts. Run `/costs` any time for usage, spend, and tips.
+
+```json
+{
+  "provider": "ollama",
+  "model": "qwen2.5-coder:7b",
+  "routing": "local-first",
+  "escalateTo": { "provider": "anthropic", "model": "claude-opus-4-8" }
+}
+```
+
 ## Project context
 
 On start, the agent walks up from the working directory looking for `AGENTS.md`
@@ -86,11 +132,14 @@ CLI flags.
 {
   "provider": "anthropic",
   "model": "claude-opus-4-8",
+  "ollamaBaseUrl": "http://localhost:11434/v1",
   "priority": "performance",
   "maxTokens": 16000,
   "thinking": true,
   "effort": "high",
   "maxIterations": 50,
+  "routing": "off",
+  "escalateTo": { "provider": "anthropic", "model": "claude-opus-4-8" },
   "allow": {
     "tools": [],
     "bash": ["npm test", "git status", "git diff"],
@@ -102,6 +151,14 @@ CLI flags.
 `allow` pre-approves mutating actions so they skip the confirmation prompt:
 `bash` matches command prefixes, `write` matches path globs for write/edit.
 
+`routing: "local-first"` plus `escalateTo` enables cost-aware routing (see
+[above](#local-models--cost-aware-routing)); it defaults to `local-first`
+automatically whenever `escalateTo` is present. `ollamaBaseUrl` points at your
+Ollama server's OpenAI-compatible endpoint.
+
+Approximate cloud pricing used for the `/costs` estimate lives in the model
+catalog (`src/models/catalog.ts`) — edit it to match current vendor rates.
+
 ## Token efficiency
 
 Minimizing token usage is a first-class goal — coding-agent bills grow fast,

diff --git a/TODO.md b/TODO.md
@@ -23,6 +23,19 @@ Explore/Plan agent). **Approach:** a `spawn_agent` tool whose `execute` construc
 a child `AgentLoop` with its own message history and a read-only tool subset,
 returning the child's final text. Keep depth at 1 to start.
 
+> Note: the cheap/expensive model split is now handled by **local-first
+> routing** (`routing: "local-first"` + `escalateTo`): turns start on the
+> local/cheap model and escalate to a frontier model when heavy or stuck (see
+> `src/agent/router.ts`, `src/tools/escalate.ts`, and the loop's escalation
+> logic). Sub-agents remain useful for *parallel* isolated runs.
+
+## More local-model interoperability
+Ollama is wired in via its OpenAI-compatible endpoint (`src/providers/ollama.ts`),
+which already covers LM Studio and vLLM (same wire format) by pointing
+`ollamaBaseUrl`/`TINY_CODE_OLLAMA_URL` at them. **Next:** an optional
+`/api/tags` probe to list locally-installed models and surface tokens/sec in the
+usage line; per-model context-window awareness for the RAM advisory.
+
 ## Web search / fetch
 Let the agent look up docs during a task. **Approach:** add `web_search` and
 `web_fetch` tools. For Anthropic, optionally delegate to the server-side

diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@therr/tiny-code",
-  "version": "0.1.0",
+  "version": "0.2.0",
   "description": "A small, extensible CLI coding agent with interchangeable Anthropic and Gemini models.",
   "type": "module",
   "bin": {

diff --git a/src/agent/decision/index.ts b/src/agent/decision/index.ts
@@ -0,0 +1,3 @@
+export type { ModelDecisionEngine, RouteDecision, TurnSignals } from './types.js';
+export { LocalFirstModelEngine } from './localFirst.js';
+export type { LocalFirstOptions } from './localFirst.js';
diff --git a/src/agent/decision/localFirst.ts b/src/agent/decision/localFirst.ts
@@ -0,0 +1,87 @@
+import type { ModelProvider } from '../../providers/types.js';
+import { classifyTurn, type TaskWeight } from '../router.js';
+import { checkLocalModel } from '../../system/resources.js';
+import { estimateCost } from '../../models/catalog.js';
+import type { ModelDecisionEngine, RouteDecision, TurnSignals } from './types.js';
+
+export interface LocalFirstOptions {
+  /** The cheap/local model that handles turns by default. */
+  primary: ModelProvider;
+  /** The frontier model heavy/stuck turns escalate to. */
+  escalation: ModelProvider;
+  /** Task-weight classifier. Defaults to {@link classifyTurn}. */
+  classify?: (input: string) => TaskWeight;
+  /** Consecutive tool-error iterations before auto-escalating. Defaults to 3. */
+  stuckThreshold?: number;
+  /** RAM-fit check (injectable for tests). Defaults to {@link checkLocalModel}. */
+  ramCheck?: (model: string) => { warn: boolean };
+}
+
+/** Per-MTok probe used to compare relative model cost for the cost-aware note. */
+const COST_PROBE = { inputTokens: 1_000_000, outputTokens: 1_000_000 };
+
+/**
+ * The default local-first policy: handle each turn on the cheap/local model and
+ * escalate to the frontier model only when capability demands it — the task
+ * looks heavy up front, the model explicitly hands off via the `escalate` tool,
+ * it gets stuck on repeated tool errors, or (compute awareness) a local model
+ * won't fit in available RAM.
+ *
+ * Cost awareness is wired in but used defensively: the engine can *report* the
+ * cost implication of an escalation (see {@link costNote}) but never initiates
+ * a cost-driven route change — escalation is always a capability decision.
+ */
+export class LocalFirstModelEngine implements ModelDecisionEngine {
+  private readonly primary: ModelProvider;
+  private readonly escalation: ModelProvider;
+  private readonly classify: (input: string) => TaskWeight;
+  private readonly stuckThreshold: number;
+  private readonly ramCheck: (model: string) => { warn: boolean };
+
+  constructor(opts: LocalFirstOptions) {
+    this.primary = opts.primary;
+    this.escalation = opts.escalation;
+    this.classify = opts.classify ?? classifyTurn;
+    this.stuckThreshold = opts.stuckThreshold ?? 3;
+    this.ramCheck = opts.ramCheck ?? checkLocalModel;
+  }
+
+  selectInitial(userInput: string): RouteDecision {
+    // Compute awareness: a local primary that won't fit in RAM runs slowly or
+    // fails outright, so route it to the frontier up front — even for light work.
+    if (this.primary.name === 'ollama' && this.ramCheck(this.primary.model).warn) {
+      return { provider: this.escalation, reason: 'compute: local model exceeds RAM' };
+    }
+    if (this.classify(userInput) === 'heavy') {
+      return { provider: this.escalation, reason: 'heavy task' };
+    }
+    return { provider: this.primary };
+  }
+
+  considerEscalation(signals: TurnSignals): RouteDecision | undefined {
+    if (signals.alreadyEscalated) return undefined;
+    if (signals.escalateRequested) {
+      return { provider: this.escalation, reason: 'requested by model' };
+    }
+    if (signals.consecutiveErrors >= this.stuckThreshold) {
+      return { provider: this.escalation, reason: 'stuck — repeated tool errors' };
+    }
+    return undefined;
+  }
+
+  /**
+   * A short, human-readable summary of what escalating costs, relative to the
+   * primary. Pure reporting — does not influence routing. Local models report
+   * as having no API cost.
+   */
+  costNote(): string {
+    const from = estimateCost(this.primary.model, COST_PROBE);
+    const to = estimateCost(this.escalation.model, COST_PROBE);
+    if (to === null) return 'escalates to a local model (no API cost)';
+    if (from === null || from === 0) {
+      return 'escalates from a free/local model to a paid model (adds API cost)';
+    }
+    const multiplier = to / from;
+    return `escalation costs ~${multiplier.toFixed(1)}× the primary per token`;
+  }
+}
diff --git a/src/agent/decision/types.ts b/src/agent/decision/types.ts
@@ -0,0 +1,47 @@
+import type { ModelProvider } from '../../providers/types.js';
+
+/**
+ * A model-selection decision: the provider to use and an optional human-readable
+ * reason. An empty/absent `reason` means "no route note" — the loop only fires
+ * {@link AgentUI.onRoute} when a decision both changes the active provider and
+ * carries a reason.
+ */
+export interface RouteDecision {
+  provider: ModelProvider;
+  reason?: string;
+}
+
+/**
+ * Runtime signals the loop feeds the engine once per iteration so it can decide
+ * whether to switch providers mid-turn. These are mechanical bookkeeping the
+ * loop already tracks; the engine owns the policy that interprets them.
+ */
+export interface TurnSignals {
+  /** The model called the `escalate` tool during this iteration. */
+  escalateRequested: boolean;
+  /** Consecutive iterations that ended with at least one tool error. */
+  consecutiveErrors: number;
+  /** Whether the turn has already been escalated (keeps the engine stateless). */
+  alreadyEscalated: boolean;
+  /** The provider currently handling the turn. */
+  current: ModelProvider;
+  /** Iteration index (0-based). */
+  iteration: number;
+}
+
+/**
+ * Owns all model-selection policy. {@link AgentLoop} depends on this interface
+ * instead of inlining classification + escalation rules, so the loop keeps only
+ * mechanism (send → stream → run tools → repeat) and the policy lives in one
+ * cohesive, testable place. Implementations may reason about task weight, cost,
+ * and compute (RAM) fit; the loop never sees that reasoning.
+ */
+export interface ModelDecisionEngine {
+  /** Pick the provider that starts a turn, given the user's input. */
+  selectInitial(userInput: string): RouteDecision;
+  /**
+   * Decide whether to switch providers mid-turn. Returns `undefined` to stay on
+   * the current provider. Called once per iteration after tool results.
+   */
+  considerEscalation(signals: TurnSignals): RouteDecision | undefined;
+}