Skip to content

Add local models, cost-aware routing, and compute/cost awareness#8

Merged
rililive merged 16 commits into
mainfrom
claude/local-model-cost-optimization-nzo2q6
Jun 9, 2026
Merged

Add local models, cost-aware routing, and compute/cost awareness#8
rililive merged 16 commits into
mainfrom
claude/local-model-cost-optimization-nzo2q6

Conversation

@rililive

@rililive rililive commented Jun 9, 2026

Copy link
Copy Markdown

Make tiny-code more cost-effective by running cheap open-weight models
locally and escalating to frontier models only for heavy work.

  • Ollama provider over the OpenAI-compatible API (raw fetch + SSE), no new
    deps. Google Gemma and other local models work as model ids.
  • Local-first routing: a heuristic classifier starts each turn on the
    cheap/local model and escalates to a configured frontier model when the
    task is heavy or the local model calls the new escalate tool / gets stuck.
  • Cost awareness: per-turn + cumulative token usage with estimated $ for
    cloud turns ("no API cost" for local), plus a startup RAM advisory that
    warns when a local model won't fit or is too small to tool-call reliably.
  • /costs command surfaces usage, spend, and workflow cost-cutting tips.
  • Config: provider 'ollama', ollamaBaseUrl, routing, escalateTo; docs updated.

All tests pass (87), coverage >80%, typecheck and lint clean.

claude and others added 16 commits June 9, 2026 13:06
Make tiny-code more cost-effective by running cheap open-weight models
locally and escalating to frontier models only for heavy work.

- Ollama provider over the OpenAI-compatible API (raw fetch + SSE), no new
  deps. Google Gemma and other local models work as model ids.
- Local-first routing: a heuristic classifier starts each turn on the
  cheap/local model and escalates to a configured frontier model when the
  task is heavy or the local model calls the new `escalate` tool / gets stuck.
- Cost awareness: per-turn + cumulative token usage with estimated $ for
  cloud turns ("no API cost" for local), plus a startup RAM advisory that
  warns when a local model won't fit or is too small to tool-call reliably.
- `/costs` command surfaces usage, spend, and workflow cost-cutting tips.
- Config: provider 'ollama', ollamaBaseUrl, routing, escalateTo; docs updated.

All tests pass (87), coverage >80%, typecheck and lint clean.
Pull the scattered routing logic (classifyTurn, escalate-tool handoff,
stuck-detection) out of AgentLoop into a cohesive, injectable
ModelDecisionEngine interface. AgentLoop now keeps only mechanism and
delegates all "which model" decisions to the engine.

- Add src/agent/decision/{types,localFirst,index}.ts: the engine
  interface plus LocalFirstModelEngine, the default policy. It reproduces
  prior behavior and adds compute awareness (routes a local model that
  won't fit RAM to the frontier) and cost awareness (costNote reporting).
- AgentLoop takes an optional `engine` instead of `escalationProvider` +
  `router`; routing: 'off' injects no engine (single-provider path
  unchanged).
- Wire the engine in repl.ts; export the new types from the public API.
- Adapt loop tests and add unit tests for the engine.

https://claude.ai/code/session_01T4UTQD35m11g4ChB8Cjd1w
Older Ollama builds reject unknown body fields with a 400. Sending
stream_options unconditionally meant every local turn could break over
a token-reporting nicety. Retry once without it on a 400.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The raw fetch had no AbortSignal, so a stuck or RAM-starved local model
left the prompt frozen with no recovery. Add an AbortController with a
120s idle timeout (reset on every received chunk, so slow-but-progressing
generations still complete) surfaced as a clear error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Ollama body set no max_tokens, so a user who lowered maxTokens to
control length/cost saw no effect on local turns. Plumb it through as
max_tokens (omitted from the request when unset, matching prior behavior).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The translation assumes text and tool_results never interleave within a
single user turn. True given how the loop builds messages today, but it's
an implicit coupling worth flagging for future changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parseSse only yielded complete newline-terminated lines, so a closing
usage frame sent without a trailing newline was silently dropped,
hurting token-count accuracy. Flush the remaining buffer at stream end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
implement/debug/optimize/design appear in everyday one-line requests
('implement a getter', 'debug this typo'), so flagging them as heavy sent
many routine turns straight to the frontier model and undercut the
local-first cost goal. Keep only strong, unambiguous signals as always-heavy;
the ambiguous verbs now escalate only alongside a scope/complexity cue. The
local model can still self-escalate via the escalate tool when it struggles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Linux keeps most RAM in reclaimable cache, so freemem() reads low and the
free-based check spuriously warned that a model 'may exceed available
memory' on machines that run it fine. Warn off total capacity (with 20%
headroom for OS/other apps) instead; keep the free-memory comparison as an
advisory freeTight hint only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ost'

The pricing catalog is exact-match, so a future cloud model id (e.g.
claude-opus-5) returns no pricing and the usage line printed
'local (no API cost)' for a paid frontier turn — actively misleading. Thread
the provider through onUsage and, when a cloud provider's pricing is unknown,
show 'cost unknown' instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A heavy turn that started on the frontier model printed '↑ escalated to …'
even though nothing was escalated. Pass an 'initial' flag through onRoute and
word up-front routing as '▸ routed to …', reserving 'escalated' for mid-turn
hand-offs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Escalation state was run()-scoped, so a multi-turn hard task that escalated
mid-turn restarted on the local model each follow-up and could ping-pong.
Persist the escalated state for the session once a turn hands off mid-flight
(model request or stuck); clearHistory() resets it so a fresh conversation
re-routes from scratch. Up-front heavy routing stays per-turn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rililive rililive merged commit 1c7ee02 into main Jun 9, 2026
3 checks passed
@rililive rililive deleted the claude/local-model-cost-optimization-nzo2q6 branch June 9, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants