rili-live · zizzle6717 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/.changeset/local-models-cost-routing.md b/.changeset/local-models-cost-routing.md
diff --git a/.env.example b/.env.example
@@ -1,13 +1,18 @@
-# Provide at least one for cloud providers. If both are present, Anthropic is
-# the default. Ollama runs locally and needs no key.
+# Provide at least one for cloud providers. If several are present, the default
+# is the first available in this order: Anthropic, Gemini, DeepSeek, Qwen.
+# Ollama runs locally and needs no key.
 ANTHROPIC_API_KEY=
 GEMINI_API_KEY=
+DEEPSEEK_API_KEY=
+QWEN_API_KEY=          # Alibaba DashScope key (DASHSCOPE_API_KEY also accepted)
 
 # Optional overrides (also settable via config file / CLI flags)
-# TINY_CODE_PROVIDER=anthropic   # anthropic | gemini | ollama
+# TINY_CODE_PROVIDER=anthropic   # anthropic | gemini | ollama | deepseek | qwen
 # TINY_CODE_MODEL=claude-opus-4-8
 # TINY_CODE_OLLAMA_URL=http://localhost:11434/v1   # Ollama OpenAI-compatible endpoint
-# TINY_CODE_PRIORITY=performance # performance | cost | balanced — auto-picks a model when none is pinned
+# TINY_CODE_DEEPSEEK_URL=https://api.deepseek.com/v1
+# TINY_CODE_QWEN_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
+# TINY_CODE_PRIORITY=balanced   # performance | cost | balanced (default) — auto-picks a model when none is pinned
 # TINY_CODE_EFFORT=high          # low | medium | high | xhigh | max — Anthropic thinking budget
 
 # Self-improvement: reflect on sessions and propose markdown-only improvement PRs.

diff --git a/AGENTS.md b/AGENTS.md
@@ -29,9 +29,11 @@ runaway costs.
 - Keep it current: when adding/repricing a model, update its entry **and**
   `CATALOG_AS_OF`. Anthropic pricing comes from the bundled claude-api reference;
   verify Gemini pricing against Google's published rates. Don't guess prices.
-- `priority` defaults to `performance`, which preserves the historical default
-  models (Opus for Anthropic, Gemini 2.5 Pro for Gemini). Don't change the
-  default without updating the config tests that assert those ids.
+- `priority` defaults to `balanced` (best capability-per-dollar behind a quality
+  floor), so the auto-picked model is cost-aware by default — e.g. Sonnet rather
+  than Opus for Anthropic. `performance` restores the historical most-capable
+  picks. Don't change the default without updating the config/catalog tests that
+  assert those ids.
 
 ## Boundaries
 - No business logic. This is a general-purpose tool.

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,63 @@
+# @therr/tiny-code
+
+## 0.3.0
+
+### Minor Changes
+
+- 118faa0: Default model selection to `balanced` priority.
+
+  When no `model` is pinned, tiny-code now defaults to `priority: "balanced"`
+  instead of `performance`, picking the best capability-per-dollar model
+  (`codingScore / blendedCostPerMTok`, behind a quality floor) rather than the
+  most capable regardless of price. In line with the project's token-minimalism
+  goal, this makes the out-of-the-box pick cost-aware — e.g. Claude Sonnet rather
+  than Opus for Anthropic. Set `priority: "performance"` (or
+  `TINY_CODE_PRIORITY=performance`) to restore the previous most-capable defaults;
+  pinning a `model` still overrides everything.
+
+- 785b832: Add local models and cost-aware, local-first routing.
+  - **Local (Ollama) provider.** Talk to a local Ollama server over its
+    OpenAI-compatible API (`--provider ollama`), with an idle timeout so a hung
+    model can't freeze the REPL, best-effort token-usage reporting, and configurable
+    `maxTokens`.
+  - **Local-first routing.** Set `routing: "local-first"` with an `escalateTo`
+    target to run a cheap/local model by default and escalate heavy turns (or a
+    stuck local model, via the new `escalate` tool) to a frontier model — with full
+    conversation context preserved. Escalation is sticky across follow-up turns.
+  - **Model-selection policy** is now owned by a pluggable `ModelDecisionEngine`
+    (`LocalFirstModelEngine`), keeping the agent loop pure mechanism.
+  - **Compute awareness.** On startup with a local model, tiny-code estimates RAM
+    need vs. machine capacity and warns when a model likely won't fit or is too
+    small (≤3B) to tool-call reliably; an over-RAM local model is routed to the
+    frontier up front.
+  - **Priority-driven model selection.** `priority` (`performance` / `cost` /
+    `balanced`, or `TINY_CODE_PRIORITY`) auto-picks a catalog model when none is
+    pinned.
+  - The `/costs` view reports session usage, estimated spend, and routing, and the
+    usage line distinguishes an unpriced _cloud_ turn ("cost unknown") from a
+    _local_ turn ("no API cost").
+
+- f5c3832: Add a `/priority` command to switch cost/performance bias mid-session.
+
+  `/priority` (no args) shows the current priority and the active model;
+  `/priority performance | balanced | cost` switches it and re-picks the
+  auto-selected model on the fly — e.g. jump to the most capable model when a task
+  gets hard, then drop back to `balanced`. Pinned models and local-first routing
+  keep governing the model themselves, so there the command just records the new
+  priority. Backed by a new `AgentLoop.setProvider` for swapping the active
+  provider mid-session, and a `modelPinned` flag on the resolved config.
+
+- 52b179d: Add DeepSeek and Qwen Coder model support.
+  - **DeepSeek and Qwen providers.** Two new hosted, OpenAI-compatible providers
+    (`--provider deepseek` / `--provider qwen`), keyed by `DEEPSEEK_API_KEY` and
+    `QWEN_API_KEY` (or `DASHSCOPE_API_KEY`). Endpoints are overridable via
+    `TINY_CODE_DEEPSEEK_URL` / `TINY_CODE_QWEN_URL` or `deepseekBaseUrl` /
+    `qwenBaseUrl` in config — e.g. to target the international DashScope host.
+  - **Shared OpenAI-compatible core.** The streaming/tool-call adapter that backed
+    the Ollama provider is now a reusable `OpenAiCompatibleProvider` base; Ollama,
+    DeepSeek, and Qwen all extend it, differing only in endpoint, auth, and error
+    wording.
+  - **Catalog entries** for `deepseek-v4-pro`, `deepseek-v4-flash`,
+    `qwen3-coder-plus`, and `qwen3-coder-flash`, so `/costs` estimates and
+    priority-based model selection work for the new providers. `/costs` treats both
+    as paid cloud providers.
diff --git a/README.md b/README.md
@@ -3,9 +3,9 @@
 A small, extensible CLI coding agent built around one constraint: **keep token
 usage low**. As coding-agent costs climb, tiny-code automates the savings so
 you don't have to. Interactive terminal REPL, interchangeable **Anthropic**,
-**Gemini**, and **local (Ollama)** models, and just the core features you
-actually use: read/write/edit files, run shell commands, search code, and a
-custom commands/skills system. No business logic baked in.
+**Gemini**, **DeepSeek**, **Qwen Coder**, and **local (Ollama)** models, and just
+the core features you actually use: read/write/edit files, run shell commands,
+search code, and a custom commands/skills system. No business logic baked in.
 
 Run cheap, open-weight models locally and **escalate heavy work to a frontier
 model only when needed** — see [Local models & cost-aware routing](#local-models--cost-aware-routing).
@@ -29,19 +29,28 @@ node dist/cli.js
 
 ## Setup
 
-Provide at least one API key. If both are set, Anthropic is used by default.
+Provide at least one API key. If several are set, the default is the first
+available in this order: Anthropic, Gemini, DeepSeek, Qwen.
 
 ```bash
 export ANTHROPIC_API_KEY=sk-ant-...
 export GEMINI_API_KEY=...
+export DEEPSEEK_API_KEY=sk-...
+export QWEN_API_KEY=sk-...        # Alibaba DashScope key (DASHSCOPE_API_KEY also works)
 ```
 
+DeepSeek and Qwen are hosted, OpenAI-compatible coding models. Override their
+endpoints with `TINY_CODE_DEEPSEEK_URL` / `TINY_CODE_QWEN_URL` (or `deepseekBaseUrl`
+/ `qwenBaseUrl` in config) — e.g. to point Qwen at the international DashScope host.
+
 ## Usage
 
 ```bash
 tiny-code                       # start the REPL (uses an available key)
 tiny-code --provider gemini     # force a provider
 tiny-code --model claude-opus-4-8
+tiny-code --provider deepseek --model deepseek-v4-pro     # DeepSeek's coding model
+tiny-code --provider qwen --model qwen3-coder-plus        # Qwen Coder
 tiny-code --provider ollama --model gemma3:12b   # run a local model (no API cost)
 ```
 
@@ -52,6 +61,7 @@ shell commands) prompt for approval unless pre-approved in config.
 - `/costs` — session token usage, estimated $ cost, and cost-saving tips
 - `/clear` — clear the conversation history and start fresh
 - `/models` — show known models, pricing, and the active one (see below)
+- `/priority [performance|balanced|cost]` — show or switch the cost/performance priority mid-session; re-picks the auto-selected model unless one is pinned (see below)
 - `/improve` — reflect on the session and propose an improvement PR (see below)
 - `/<name> [args]` — run a custom command (see below)
 - `/exit` — quit
@@ -133,7 +143,7 @@ CLI flags.
   "provider": "anthropic",
   "model": "claude-opus-4-8",
   "ollamaBaseUrl": "http://localhost:11434/v1",
-  "priority": "performance",
+  "priority": "balanced",
   "maxTokens": 16000,
   "thinking": true,
   "effort": "high",
@@ -154,7 +164,8 @@ CLI flags.
 `routing: "local-first"` plus `escalateTo` enables cost-aware routing (see
 [above](#local-models--cost-aware-routing)); it defaults to `local-first`
 automatically whenever `escalateTo` is present. `ollamaBaseUrl` points at your
-Ollama server's OpenAI-compatible endpoint.
+Ollama server's OpenAI-compatible endpoint; `deepseekBaseUrl` / `qwenBaseUrl`
+override the DeepSeek and Qwen (DashScope) endpoints.
 
 Approximate cloud pricing used for the `/costs` estimate lives in the model
 catalog (`src/models/catalog.ts`) — edit it to match current vendor rates.
@@ -203,18 +214,25 @@ money and to pick a model that fits your cost/performance preference.
 - **Priority-driven selection.** When you don't pin a `model`, tiny-code picks
   one for you based on `priority`:
 
-  | `priority`      | Picks                                                        |
-  | --------------- | ----------------------------------------------------------- |
-  | `performance`   | The most capable model (the default — current behavior).    |
-  | `cost`          | The cheapest still-capable model.                           |
-  | `balanced`      | The best capability-per-dollar among capable models.        |
+  | `priority`      | Picks                                                            |
+  | --------------- | --------------------------------------------------------------- |
+  | `balanced`      | The best capability-per-dollar among capable models (default).  |
+  | `performance`   | The most capable model, ignoring price.                         |
+  | `cost`          | The cheapest still-capable model.                               |
+
+  `balanced` is the default: it ranks capable models by
+  `codingScore / blendedCostPerMTok` (a model's coding aptitude per blended
+  dollar, weighting input 80% / output 20%) behind a quality floor, so you get
+  strong-but-sensibly-priced models without opting in.
 
   ```json
-  { "priority": "balanced" }
+  { "priority": "performance" }
   ```
 
-  Or per-session with `TINY_CODE_PRIORITY=cost`. Pinning `model` (config, env,
-  or `--model`) always overrides the recommendation.
+  Or per-session with `TINY_CODE_PRIORITY=cost`, or on the fly with the
+  `/priority` command (e.g. `/priority performance` to jump to the most capable
+  model when a task gets hard, then `/priority balanced` to drop back). Pinning
+  `model` (config, env, or `--model`) always overrides the recommendation.
 
 The catalog is curated and offline (tiny-code has no live model-discovery yet —
 see `TODO.md`), so its prices carry an "as of" date; keep it current as vendors

diff --git a/TODO.md b/TODO.md
@@ -17,6 +17,20 @@ a single condensed block. For Anthropic use the compaction beta; for Gemini
 summarize via a lightweight call to a cheap model. Pair with conversation
 persistence so compacted sessions can be resumed.
 
+## On-the-fly provider switching
+The `/priority` command already swaps the active *model* within the current
+provider mid-session (`AgentLoop.setProvider`). Extend this to switch the
+*provider* too, so a session can move between Anthropic, Gemini, DeepSeek, Qwen,
+and Ollama without restarting. **Approach:** a `/provider <name> [model]`
+command that validates the target's API key (reuse `createProvider`'s checks),
+re-resolves the model (honoring `priority` and any pin), rebuilds the provider,
+and calls `agent.setProvider`. Decide how it interacts with local-first routing
+(switching the primary vs. the `escalateTo` target) and keep `/costs` accurate
+across providers — usage is already priced per-turn from the active model, so the
+running total stays correct; just refresh the session-end summary's model.
+Consider a single `/model <id>` shortcut that infers the provider from the
+catalog entry.
+
 ## Sub-agents
 Spawn isolated agent runs for parallel exploration/research (like a lightweight
 Explore/Plan agent). **Approach:** a `spawn_agent` tool whose `execute` constructs

diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@therr/tiny-code",
-  "version": "0.2.3",
+  "version": "0.3.0",
   "description": "A small, extensible CLI coding agent with interchangeable Anthropic and Gemini models.",
   "type": "module",
   "bin": {

diff --git a/src/agent/loop.ts b/src/agent/loop.ts
@@ -51,7 +51,7 @@ export interface AgentLoopOptions {
  * iteration guard trips). Conversation state persists across `run` calls.
  */
 export class AgentLoop {
-  private readonly provider: ModelProvider;
+  private provider: ModelProvider;
   private readonly registry: ToolRegistry;
   private readonly gate: PermissionGate;
   private readonly system: string;
@@ -86,6 +86,16 @@ export class AgentLoop {
     return this.messages;
   }
 
+  /**
+   * Swap the base provider mid-session — e.g. when the user changes the active
+   * model via `/priority`. Only affects un-escalated turns; if the session has
+   * stuck to an escalated frontier provider, that takes precedence until
+   * `clearHistory()` resets routing.
+   */
+  setProvider(provider: ModelProvider): void {
+    this.provider = provider;
+  }
+
   /** Drop the conversation history so the next turn starts fresh. Cumulative
    *  token usage is preserved, since it reflects the whole session's cost.
    *  Also clears sticky escalation: a fresh conversation re-routes from scratch. */

diff --git a/src/cli.ts b/src/cli.ts
@@ -12,18 +12,21 @@ Usage:
   tiny-code [options]
 
 Options:
-  --provider <name>   anthropic | gemini | ollama (default: inferred from API keys)
-  --model <id>        Model id override (e.g. claude-opus-4-8, gemma3:12b)
+  --provider <name>   anthropic | gemini | ollama | deepseek | qwen
+                      (default: inferred from API keys)
+  --model <id>        Model id override (e.g. claude-opus-4-8, qwen3-coder-plus)
   --config <path>     Path to a config JSON file
   -v, --version       Print version
   -h, --help          Show this help
 
 Environment:
   ANTHROPIC_API_KEY    Required for the Anthropic provider
   GEMINI_API_KEY       Required for the Gemini provider
+  DEEPSEEK_API_KEY     Required for the DeepSeek provider
+  QWEN_API_KEY         Required for the Qwen provider (or DASHSCOPE_API_KEY)
   TINY_CODE_OLLAMA_URL Ollama OpenAI-compatible base URL (default http://localhost:11434/v1)
   TINY_CODE_PRIORITY   performance | cost | balanced — auto-picks a model when
-                       none is pinned (default: performance)
+                       none is pinned (default: balanced)
 
 Cost-saving: set "routing": "local-first" with an "escalateTo" target in your
 config to run cheap/local models by default and escalate heavy tasks. Run /costs