Add local models, cost-aware routing, and compute/cost awareness by rililive · Pull Request #8 · rili-live/tiny-code

rililive · 2026-06-09T13:28:28Z

Make tiny-code more cost-effective by running cheap open-weight models
locally and escalating to frontier models only for heavy work.

Ollama provider over the OpenAI-compatible API (raw fetch + SSE), no new
deps. Google Gemma and other local models work as model ids.
Local-first routing: a heuristic classifier starts each turn on the
cheap/local model and escalates to a configured frontier model when the
task is heavy or the local model calls the new escalate tool / gets stuck.
Cost awareness: per-turn + cumulative token usage with estimated $ for
cloud turns ("no API cost" for local), plus a startup RAM advisory that
warns when a local model won't fit or is too small to tool-call reliably.
/costs command surfaces usage, spend, and workflow cost-cutting tips.
Config: provider 'ollama', ollamaBaseUrl, routing, escalateTo; docs updated.

All tests pass (87), coverage >80%, typecheck and lint clean.

Make tiny-code more cost-effective by running cheap open-weight models locally and escalating to frontier models only for heavy work. - Ollama provider over the OpenAI-compatible API (raw fetch + SSE), no new deps. Google Gemma and other local models work as model ids. - Local-first routing: a heuristic classifier starts each turn on the cheap/local model and escalates to a configured frontier model when the task is heavy or the local model calls the new `escalate` tool / gets stuck. - Cost awareness: per-turn + cumulative token usage with estimated $ for cloud turns ("no API cost" for local), plus a startup RAM advisory that warns when a local model won't fit or is too small to tool-call reliably. - `/costs` command surfaces usage, spend, and workflow cost-cutting tips. - Config: provider 'ollama', ollamaBaseUrl, routing, escalateTo; docs updated. All tests pass (87), coverage >80%, typecheck and lint clean.

Pull the scattered routing logic (classifyTurn, escalate-tool handoff, stuck-detection) out of AgentLoop into a cohesive, injectable ModelDecisionEngine interface. AgentLoop now keeps only mechanism and delegates all "which model" decisions to the engine. - Add src/agent/decision/{types,localFirst,index}.ts: the engine interface plus LocalFirstModelEngine, the default policy. It reproduces prior behavior and adds compute awareness (routes a local model that won't fit RAM to the frontier) and cost awareness (costNote reporting). - AgentLoop takes an optional `engine` instead of `escalationProvider` + `router`; routing: 'off' injects no engine (single-provider path unchanged). - Wire the engine in repl.ts; export the new types from the public API. - Adapt loop tests and add unit tests for the engine. https://claude.ai/code/session_01T4UTQD35m11g4ChB8Cjd1w

Older Ollama builds reject unknown body fields with a 400. Sending stream_options unconditionally meant every local turn could break over a token-reporting nicety. Retry once without it on a 400. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The raw fetch had no AbortSignal, so a stuck or RAM-starved local model left the prompt frozen with no recovery. Add an AbortController with a 120s idle timeout (reset on every received chunk, so slow-but-progressing generations still complete) surfaced as a clear error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Ollama body set no max_tokens, so a user who lowered maxTokens to control length/cost saw no effect on local turns. Plumb it through as max_tokens (omitted from the request when unset, matching prior behavior). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The translation assumes text and tool_results never interleave within a single user turn. True given how the loop builds messages today, but it's an implicit coupling worth flagging for future changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

parseSse only yielded complete newline-terminated lines, so a closing usage frame sent without a trailing newline was silently dropped, hurting token-count accuracy. Flush the remaining buffer at stream end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

implement/debug/optimize/design appear in everyday one-line requests ('implement a getter', 'debug this typo'), so flagging them as heavy sent many routine turns straight to the frontier model and undercut the local-first cost goal. Keep only strong, unambiguous signals as always-heavy; the ambiguous verbs now escalate only alongside a scope/complexity cue. The local model can still self-escalate via the escalate tool when it struggles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Linux keeps most RAM in reclaimable cache, so freemem() reads low and the free-based check spuriously warned that a model 'may exceed available memory' on machines that run it fine. Warn off total capacity (with 20% headroom for OS/other apps) instead; keep the free-memory comparison as an advisory freeTight hint only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ost' The pricing catalog is exact-match, so a future cloud model id (e.g. claude-opus-5) returns no pricing and the usage line printed 'local (no API cost)' for a paid frontier turn — actively misleading. Thread the provider through onUsage and, when a cloud provider's pricing is unknown, show 'cost unknown' instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A heavy turn that started on the frontier model printed '↑ escalated to …' even though nothing was escalated. Pass an 'initial' flag through onRoute and word up-front routing as '▸ routed to …', reserving 'escalated' for mid-turn hand-offs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Escalation state was run()-scoped, so a multi-turn hard task that escalated mid-turn restarted on the local model each follow-up and could ping-pong. Persist the escalated state for the session once a turn hands off mid-flight (model request or stuck); clearHistory() resets it so a fresh conversation re-routes from scratch. Up-front heavy routing stays per-turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude and others added 16 commits June 9, 2026 13:06

fix: merge conflicts

629783a

fix: merge conflicts

07992ba

fix: resolve broken import and regressed UI labeling

785b832

release: v0.2.0

1c7ee02

rililive merged commit 1c7ee02 into main Jun 9, 2026
3 checks passed

rililive deleted the claude/local-model-cost-optimization-nzo2q6 branch June 9, 2026 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local models, cost-aware routing, and compute/cost awareness#8

Add local models, cost-aware routing, and compute/cost awareness#8
rililive merged 16 commits into
mainfrom
claude/local-model-cost-optimization-nzo2q6

rililive commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rililive commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants