Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .changeset/local-models-cost-routing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
"@therr/tiny-code": minor
---

Add local models and cost-aware, local-first routing.

- **Local (Ollama) provider.** Talk to a local Ollama server over its
OpenAI-compatible API (`--provider ollama`), with an idle timeout so a hung
model can't freeze the REPL, best-effort token-usage reporting, and configurable
`maxTokens`.
- **Local-first routing.** Set `routing: "local-first"` with an `escalateTo`
target to run a cheap/local model by default and escalate heavy turns (or a
stuck local model, via the new `escalate` tool) to a frontier model — with full
conversation context preserved. Escalation is sticky across follow-up turns.
- **Model-selection policy** is now owned by a pluggable `ModelDecisionEngine`
(`LocalFirstModelEngine`), keeping the agent loop pure mechanism.
- **Compute awareness.** On startup with a local model, tiny-code estimates RAM
need vs. machine capacity and warns when a model likely won't fit or is too
small (≤3B) to tool-call reliably; an over-RAM local model is routed to the
frontier up front.
- **Priority-driven model selection.** `priority` (`performance` / `cost` /
`balanced`, or `TINY_CODE_PRIORITY`) auto-picks a catalog model when none is
pinned.
- The `/costs` view reports session usage, estimated spend, and routing, and the
usage line distinguishes an unpriced *cloud* turn ("cost unknown") from a
*local* turn ("no API cost").
8 changes: 6 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# Provide at least one. If both are present, Anthropic is used by default.
# Provide at least one for cloud providers. If both are present, Anthropic is
# the default. Ollama runs locally and needs no key.
ANTHROPIC_API_KEY=
GEMINI_API_KEY=

# Optional overrides (also settable via config file / CLI flags)
# TINY_CODE_PROVIDER=anthropic # anthropic | gemini
# TINY_CODE_PROVIDER=anthropic # anthropic | gemini | ollama
# TINY_CODE_MODEL=claude-opus-4-8
# TINY_CODE_OLLAMA_URL=http://localhost:11434/v1 # Ollama OpenAI-compatible endpoint
# TINY_CODE_PRIORITY=performance # performance | cost | balanced — auto-picks a model when none is pinned
# TINY_CODE_EFFORT=high # low | medium | high | xhigh | max — Anthropic thinking budget

# Self-improvement: reflect on sessions and propose markdown-only improvement PRs.
# On by default; set to 0 to disable. Requires the `gh` CLI installed + authed.
Expand Down
65 changes: 61 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@

A small, extensible CLI coding agent built around one constraint: **keep token
usage low**. As coding-agent costs climb, tiny-code automates the savings so
you don't have to. Interactive terminal REPL, interchangeable **Anthropic** and
**Gemini** models, and just the core features you actually use: read/write/edit
files, run shell commands, search code, and a custom commands/skills system.
No business logic baked in.
you don't have to. Interactive terminal REPL, interchangeable **Anthropic**,
**Gemini**, and **local (Ollama)** models, and just the core features you
actually use: read/write/edit files, run shell commands, search code, and a
custom commands/skills system. No business logic baked in.

Run cheap, open-weight models locally and **escalate heavy work to a frontier
model only when needed** — see [Local models & cost-aware routing](#local-models--cost-aware-routing).

> Status: early (v0.x). Published as `@therr/tiny-code`; the binary is
> `tiny-code`. Names may change before the first npm publish.
Expand Down Expand Up @@ -39,18 +42,61 @@ export GEMINI_API_KEY=...
tiny-code # start the REPL (uses an available key)
tiny-code --provider gemini # force a provider
tiny-code --model claude-opus-4-8
tiny-code --provider ollama --model gemma3:12b # run a local model (no API cost)
```

In the REPL: type a request, watch it work. Mutating actions (writes, edits,
shell commands) prompt for approval unless pre-approved in config.

- `/help` — list commands
- `/costs` — session token usage, estimated $ cost, and cost-saving tips
- `/clear` — clear the conversation history and start fresh
- `/models` — show known models, pricing, and the active one (see below)
- `/improve` — reflect on the session and propose an improvement PR (see below)
- `/<name> [args]` — run a custom command (see below)
- `/exit` — quit

## Local models & cost-aware routing

tiny-code talks to a local [Ollama](https://ollama.com) server over its
OpenAI-compatible API, so any model you've pulled is available — including
**Google Gemma 3** (`gemma3:4b`, `gemma3:12b`, `gemma3:27b`) and
`qwen2.5-coder` (the default, which tool-calls reliably).

```bash
ollama serve
ollama pull qwen2.5-coder:7b
tiny-code --provider ollama --model qwen2.5-coder:7b
```

**Mind the compute cost.** Local models are free of API charges but use your
machine's RAM/VRAM. On startup with an Ollama model, tiny-code prints how much
memory the model needs versus what's free, and warns if it likely won't fit or
if the model is too small (≤3B) to tool-call reliably. Rough guide (≈Q4):

| Model | ~RAM needed | Good for |
| ------------ | ----------- | --------------------------------- |
| `gemma3:1b` | ~1 GB | trivial text (poor at tool calls) |
| `gemma3:4b` | ~3 GB | lightweight edits, search |
| `gemma3:12b` | ~7 GB | most coding tasks |
| `gemma3:27b` | ~16 GB | stronger reasoning |

**Local-first routing.** Set a `routing` of `local-first` with an `escalateTo`
target: every turn starts on the cheap/local model, and tiny-code escalates to
the frontier model when a turn looks heavy (refactors, debugging, multi-file
work) or when the local model gets stuck and calls the built-in `escalate` tool.
You get local speed and zero cost for the bulk of the work, and frontier power
only for the hard parts. Run `/costs` any time for usage, spend, and tips.

```json
{
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"routing": "local-first",
"escalateTo": { "provider": "anthropic", "model": "claude-opus-4-8" }
}
```

## Project context

On start, the agent walks up from the working directory looking for `AGENTS.md`
Expand Down Expand Up @@ -86,11 +132,14 @@ CLI flags.
{
"provider": "anthropic",
"model": "claude-opus-4-8",
"ollamaBaseUrl": "http://localhost:11434/v1",
"priority": "performance",
"maxTokens": 16000,
"thinking": true,
"effort": "high",
"maxIterations": 50,
"routing": "off",
"escalateTo": { "provider": "anthropic", "model": "claude-opus-4-8" },
"allow": {
"tools": [],
"bash": ["npm test", "git status", "git diff"],
Expand All @@ -102,6 +151,14 @@ CLI flags.
`allow` pre-approves mutating actions so they skip the confirmation prompt:
`bash` matches command prefixes, `write` matches path globs for write/edit.

`routing: "local-first"` plus `escalateTo` enables cost-aware routing (see
[above](#local-models--cost-aware-routing)); it defaults to `local-first`
automatically whenever `escalateTo` is present. `ollamaBaseUrl` points at your
Ollama server's OpenAI-compatible endpoint.

Approximate cloud pricing used for the `/costs` estimate lives in the model
catalog (`src/models/catalog.ts`) — edit it to match current vendor rates.

## Token efficiency

Minimizing token usage is a first-class goal — coding-agent bills grow fast,
Expand Down
13 changes: 13 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,19 @@ Explore/Plan agent). **Approach:** a `spawn_agent` tool whose `execute` construc
a child `AgentLoop` with its own message history and a read-only tool subset,
returning the child's final text. Keep depth at 1 to start.

> Note: the cheap/expensive model split is now handled by **local-first
> routing** (`routing: "local-first"` + `escalateTo`): turns start on the
> local/cheap model and escalate to a frontier model when heavy or stuck (see
> `src/agent/router.ts`, `src/tools/escalate.ts`, and the loop's escalation
> logic). Sub-agents remain useful for *parallel* isolated runs.

## More local-model interoperability
Ollama is wired in via its OpenAI-compatible endpoint (`src/providers/ollama.ts`),
which already covers LM Studio and vLLM (same wire format) by pointing
`ollamaBaseUrl`/`TINY_CODE_OLLAMA_URL` at them. **Next:** an optional
`/api/tags` probe to list locally-installed models and surface tokens/sec in the
usage line; per-model context-window awareness for the RAM advisory.

## Web search / fetch
Let the agent look up docs during a task. **Approach:** add `web_search` and
`web_fetch` tools. For Anthropic, optionally delegate to the server-side
Expand Down
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@therr/tiny-code",
"version": "0.1.0",
"version": "0.2.0",
"description": "A small, extensible CLI coding agent with interchangeable Anthropic and Gemini models.",
"type": "module",
"bin": {
Expand Down
3 changes: 3 additions & 0 deletions src/agent/decision/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
export type { ModelDecisionEngine, RouteDecision, TurnSignals } from './types.js';
export { LocalFirstModelEngine } from './localFirst.js';
export type { LocalFirstOptions } from './localFirst.js';
87 changes: 87 additions & 0 deletions src/agent/decision/localFirst.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import type { ModelProvider } from '../../providers/types.js';
import { classifyTurn, type TaskWeight } from '../router.js';
import { checkLocalModel } from '../../system/resources.js';
import { estimateCost } from '../../models/catalog.js';
import type { ModelDecisionEngine, RouteDecision, TurnSignals } from './types.js';

export interface LocalFirstOptions {
/** The cheap/local model that handles turns by default. */
primary: ModelProvider;
/** The frontier model heavy/stuck turns escalate to. */
escalation: ModelProvider;
/** Task-weight classifier. Defaults to {@link classifyTurn}. */
classify?: (input: string) => TaskWeight;
/** Consecutive tool-error iterations before auto-escalating. Defaults to 3. */
stuckThreshold?: number;
/** RAM-fit check (injectable for tests). Defaults to {@link checkLocalModel}. */
ramCheck?: (model: string) => { warn: boolean };
}

/** Per-MTok probe used to compare relative model cost for the cost-aware note. */
const COST_PROBE = { inputTokens: 1_000_000, outputTokens: 1_000_000 };

/**
* The default local-first policy: handle each turn on the cheap/local model and
* escalate to the frontier model only when capability demands it — the task
* looks heavy up front, the model explicitly hands off via the `escalate` tool,
* it gets stuck on repeated tool errors, or (compute awareness) a local model
* won't fit in available RAM.
*
* Cost awareness is wired in but used defensively: the engine can *report* the
* cost implication of an escalation (see {@link costNote}) but never initiates
* a cost-driven route change — escalation is always a capability decision.
*/
export class LocalFirstModelEngine implements ModelDecisionEngine {
private readonly primary: ModelProvider;
private readonly escalation: ModelProvider;
private readonly classify: (input: string) => TaskWeight;
private readonly stuckThreshold: number;
private readonly ramCheck: (model: string) => { warn: boolean };

constructor(opts: LocalFirstOptions) {
this.primary = opts.primary;
this.escalation = opts.escalation;
this.classify = opts.classify ?? classifyTurn;
this.stuckThreshold = opts.stuckThreshold ?? 3;
this.ramCheck = opts.ramCheck ?? checkLocalModel;
}

selectInitial(userInput: string): RouteDecision {
// Compute awareness: a local primary that won't fit in RAM runs slowly or
// fails outright, so route it to the frontier up front — even for light work.
if (this.primary.name === 'ollama' && this.ramCheck(this.primary.model).warn) {
return { provider: this.escalation, reason: 'compute: local model exceeds RAM' };
}
if (this.classify(userInput) === 'heavy') {
return { provider: this.escalation, reason: 'heavy task' };
}
return { provider: this.primary };
}

considerEscalation(signals: TurnSignals): RouteDecision | undefined {
if (signals.alreadyEscalated) return undefined;
if (signals.escalateRequested) {
return { provider: this.escalation, reason: 'requested by model' };
}
if (signals.consecutiveErrors >= this.stuckThreshold) {
return { provider: this.escalation, reason: 'stuck — repeated tool errors' };
}
return undefined;
}

/**
* A short, human-readable summary of what escalating costs, relative to the
* primary. Pure reporting — does not influence routing. Local models report
* as having no API cost.
*/
costNote(): string {
const from = estimateCost(this.primary.model, COST_PROBE);
const to = estimateCost(this.escalation.model, COST_PROBE);
if (to === null) return 'escalates to a local model (no API cost)';
if (from === null || from === 0) {
return 'escalates from a free/local model to a paid model (adds API cost)';
}
const multiplier = to / from;
return `escalation costs ~${multiplier.toFixed(1)}× the primary per token`;
}
}
47 changes: 47 additions & 0 deletions src/agent/decision/types.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import type { ModelProvider } from '../../providers/types.js';

/**
* A model-selection decision: the provider to use and an optional human-readable
* reason. An empty/absent `reason` means "no route note" — the loop only fires
* {@link AgentUI.onRoute} when a decision both changes the active provider and
* carries a reason.
*/
export interface RouteDecision {
provider: ModelProvider;
reason?: string;
}

/**
* Runtime signals the loop feeds the engine once per iteration so it can decide
* whether to switch providers mid-turn. These are mechanical bookkeeping the
* loop already tracks; the engine owns the policy that interprets them.
*/
export interface TurnSignals {
/** The model called the `escalate` tool during this iteration. */
escalateRequested: boolean;
/** Consecutive iterations that ended with at least one tool error. */
consecutiveErrors: number;
/** Whether the turn has already been escalated (keeps the engine stateless). */
alreadyEscalated: boolean;
/** The provider currently handling the turn. */
current: ModelProvider;
/** Iteration index (0-based). */
iteration: number;
}

/**
* Owns all model-selection policy. {@link AgentLoop} depends on this interface
* instead of inlining classification + escalation rules, so the loop keeps only
* mechanism (send → stream → run tools → repeat) and the policy lives in one
* cohesive, testable place. Implementations may reason about task weight, cost,
* and compute (RAM) fit; the loop never sees that reasoning.
*/
export interface ModelDecisionEngine {
/** Pick the provider that starts a turn, given the user's input. */
selectInitial(userInput: string): RouteDecision;
/**
* Decide whether to switch providers mid-turn. Returns `undefined` to stay on
* the current provider. Called once per iteration after tool results.
*/
considerEscalation(signals: TurnSignals): RouteDecision | undefined;
}
Loading
Loading