Skip to content

Add optional cost & token-efficiency columns to the leaderboard#41

Open
lakshvantb wants to merge 1 commit into
mainfrom
cost-efficiency-columns
Open

Add optional cost & token-efficiency columns to the leaderboard#41
lakshvantb wants to merge 1 commit into
mainfrom
cost-efficiency-columns

Conversation

@lakshvantb

Copy link
Copy Markdown
Contributor

What

Adds an optional cost / token-efficiency view to the leaderboard table:

  • "Show Cost & Tokens" toggle → appends two columns for every model:
    • Output Tokens — avg output tokens per question (includes reasoning tokens)
    • Cost / Question — estimated USD per question
  • "Only Models With Cost Data" filter → narrows the table to models that have published cost data (disabled unless the cost toggle is on).
  • Both are URL-persisted (?cost=true, ?costonly=true) and reset by Clear Filters, matching the existing toggle pattern (showProvider, showReasoners, …).

How partial coverage is handled (top-models-only)

Cost/token metrics are published only for the top set of models. This is shown elegantly, not as breakage:

  • Models without data render a muted "—" with a tooltip ("published for the top models only").
  • "—" cells are stored as null, so they sort to the bottom regardless of direction (existing SortTable null-handling).
  • Want a clean list? the Only Models With Cost Data filter hides the rest.

Data

  • New optional artifact public/cost_<date>.csv (model, avg_input_tokens, avg_output_tokens, cost_per_question), merged into each row by model id so columns sort with the existing machinery.
  • Absent file = no-op — other dates are unaffected.
  • cost_2026_01_08.csv covers 14 top models. Values = billed output_tokens + reconstructed input_tokens (tiktoken o200k_base / Gemini count_tokens API / provider tokenizers for Qwen·DeepSeek·Kimi) × per-model official API prices.

Testing

  • npm run build compiles cleanly (no new warnings; feature present in the bundle).
  • Data-layer test against the real CSVs — 6/6 pass: merge correctness, missing→ (null), cost-only filter (14 rows), and null-to-bottom sorting (cheapest first = deepseek-v4-pro, $0.029/q).

Notes / follow-ups

  • Anthropic thinking models are intentionally excluded from the cost CSV (their stored output tokens predate the billed-usage fix and undercount reasoning — they need a rerun, not a backfill).
  • Natural follow-ups: a cost-vs-quality scatter (Pareto), and extending cost_<date>.csv coverage as more models are validated.

🤖 Generated with Claude Code

Adds a "Show Cost & Tokens" toggle that appends two columns — Output Tokens
(avg output incl. reasoning, per question) and Cost / Question ($) — to the
existing table for all models, plus an "Only Models With Cost Data" filter.

- Cost data is published per-date as an optional public/cost_<date>.csv and
  merged into each row, so the columns sort via the existing SortTable logic.
- Coverage is intentionally partial (top models): models without an entry render
  a muted "—" (tooltip explains coverage) and, being null, sort to the bottom.
- The cost dataset is a no-op when absent, so other dates are unaffected.
- cost_2026_01_08.csv covers 14 top models; values = billed output_tokens +
  reconstructed input_tokens (tiktoken o200k / count_tokens API / provider
  tokenizers) x per-model official API prices.

Tested: production build compiles; a data-layer test against the real CSVs
verifies merge, missing->"—", the cost-only filter (14 rows), and null-to-bottom
sorting (cheapest first = deepseek-v4-pro).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant