Add local models, cost-aware routing, and compute/cost awareness#8
Merged
Conversation
Make tiny-code more cost-effective by running cheap open-weight models
locally and escalating to frontier models only for heavy work.
- Ollama provider over the OpenAI-compatible API (raw fetch + SSE), no new
deps. Google Gemma and other local models work as model ids.
- Local-first routing: a heuristic classifier starts each turn on the
cheap/local model and escalates to a configured frontier model when the
task is heavy or the local model calls the new `escalate` tool / gets stuck.
- Cost awareness: per-turn + cumulative token usage with estimated $ for
cloud turns ("no API cost" for local), plus a startup RAM advisory that
warns when a local model won't fit or is too small to tool-call reliably.
- `/costs` command surfaces usage, spend, and workflow cost-cutting tips.
- Config: provider 'ollama', ollamaBaseUrl, routing, escalateTo; docs updated.
All tests pass (87), coverage >80%, typecheck and lint clean.
Pull the scattered routing logic (classifyTurn, escalate-tool handoff,
stuck-detection) out of AgentLoop into a cohesive, injectable
ModelDecisionEngine interface. AgentLoop now keeps only mechanism and
delegates all "which model" decisions to the engine.
- Add src/agent/decision/{types,localFirst,index}.ts: the engine
interface plus LocalFirstModelEngine, the default policy. It reproduces
prior behavior and adds compute awareness (routes a local model that
won't fit RAM to the frontier) and cost awareness (costNote reporting).
- AgentLoop takes an optional `engine` instead of `escalationProvider` +
`router`; routing: 'off' injects no engine (single-provider path
unchanged).
- Wire the engine in repl.ts; export the new types from the public API.
- Adapt loop tests and add unit tests for the engine.
https://claude.ai/code/session_01T4UTQD35m11g4ChB8Cjd1w
Older Ollama builds reject unknown body fields with a 400. Sending stream_options unconditionally meant every local turn could break over a token-reporting nicety. Retry once without it on a 400. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The raw fetch had no AbortSignal, so a stuck or RAM-starved local model left the prompt frozen with no recovery. Add an AbortController with a 120s idle timeout (reset on every received chunk, so slow-but-progressing generations still complete) surfaced as a clear error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Ollama body set no max_tokens, so a user who lowered maxTokens to control length/cost saw no effect on local turns. Plumb it through as max_tokens (omitted from the request when unset, matching prior behavior). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The translation assumes text and tool_results never interleave within a single user turn. True given how the loop builds messages today, but it's an implicit coupling worth flagging for future changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parseSse only yielded complete newline-terminated lines, so a closing usage frame sent without a trailing newline was silently dropped, hurting token-count accuracy. Flush the remaining buffer at stream end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
implement/debug/optimize/design appear in everyday one-line requests
('implement a getter', 'debug this typo'), so flagging them as heavy sent
many routine turns straight to the frontier model and undercut the
local-first cost goal. Keep only strong, unambiguous signals as always-heavy;
the ambiguous verbs now escalate only alongside a scope/complexity cue. The
local model can still self-escalate via the escalate tool when it struggles.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Linux keeps most RAM in reclaimable cache, so freemem() reads low and the free-based check spuriously warned that a model 'may exceed available memory' on machines that run it fine. Warn off total capacity (with 20% headroom for OS/other apps) instead; keep the free-memory comparison as an advisory freeTight hint only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ost' The pricing catalog is exact-match, so a future cloud model id (e.g. claude-opus-5) returns no pricing and the usage line printed 'local (no API cost)' for a paid frontier turn — actively misleading. Thread the provider through onUsage and, when a cloud provider's pricing is unknown, show 'cost unknown' instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A heavy turn that started on the frontier model printed '↑ escalated to …' even though nothing was escalated. Pass an 'initial' flag through onRoute and word up-front routing as '▸ routed to …', reserving 'escalated' for mid-turn hand-offs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Escalation state was run()-scoped, so a multi-turn hard task that escalated mid-turn restarted on the local model each follow-up and could ping-pong. Persist the escalated state for the session once a turn hands off mid-flight (model request or stuck); clearHistory() resets it so a fresh conversation re-routes from scratch. Up-front heavy routing stays per-turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make tiny-code more cost-effective by running cheap open-weight models
locally and escalating to frontier models only for heavy work.
deps. Google Gemma and other local models work as model ids.
cheap/local model and escalates to a configured frontier model when the
task is heavy or the local model calls the new
escalatetool / gets stuck.cloud turns ("no API cost" for local), plus a startup RAM advisory that
warns when a local model won't fit or is too small to tool-call reliably.
/costscommand surfaces usage, spend, and workflow cost-cutting tips.All tests pass (87), coverage >80%, typecheck and lint clean.