Add optional cost & token-efficiency columns to the leaderboard by lakshvantb · Pull Request #41 · LiveBench/livebench.github.io

lakshvantb · 2026-06-17T17:12:05Z

What

Adds an optional cost / token-efficiency view to the leaderboard table:

"Show Cost & Tokens" toggle → appends two columns for every model:
- Output Tokens — avg output tokens per question (includes reasoning tokens)
- Cost / Question — estimated USD per question
"Only Models With Cost Data" filter → narrows the table to models that have published cost data (disabled unless the cost toggle is on).
Both are URL-persisted (?cost=true, ?costonly=true) and reset by Clear Filters, matching the existing toggle pattern (showProvider, showReasoners, …).

How partial coverage is handled (top-models-only)

Cost/token metrics are published only for the top set of models. This is shown elegantly, not as breakage:

Models without data render a muted "—" with a tooltip ("published for the top models only").
"—" cells are stored as null, so they sort to the bottom regardless of direction (existing SortTable null-handling).
Want a clean list? the Only Models With Cost Data filter hides the rest.

Data

New optional artifact public/cost_<date>.csv (model, avg_input_tokens, avg_output_tokens, cost_per_question), merged into each row by model id so columns sort with the existing machinery.
Absent file = no-op — other dates are unaffected.
cost_2026_01_08.csv covers 14 top models. Values = billed output_tokens + reconstructed input_tokens (tiktoken o200k_base / Gemini count_tokens API / provider tokenizers for Qwen·DeepSeek·Kimi) × per-model official API prices.

Testing

npm run build compiles cleanly (no new warnings; feature present in the bundle).
Data-layer test against the real CSVs — 6/6 pass: merge correctness, missing→— (null), cost-only filter (14 rows), and null-to-bottom sorting (cheapest first = deepseek-v4-pro, $0.029/q).

Notes / follow-ups

Anthropic thinking models are intentionally excluded from the cost CSV (their stored output tokens predate the billed-usage fix and undercount reasoning — they need a rerun, not a backfill).
Natural follow-ups: a cost-vs-quality scatter (Pareto), and extending cost_<date>.csv coverage as more models are validated.

🤖 Generated with Claude Code

Adds a "Show Cost & Tokens" toggle that appends two columns — Output Tokens (avg output incl. reasoning, per question) and Cost / Question ($) — to the existing table for all models, plus an "Only Models With Cost Data" filter. - Cost data is published per-date as an optional public/cost_<date>.csv and merged into each row, so the columns sort via the existing SortTable logic. - Coverage is intentionally partial (top models): models without an entry render a muted "—" (tooltip explains coverage) and, being null, sort to the bottom. - The cost dataset is a no-op when absent, so other dates are unaffected. - cost_2026_01_08.csv covers 14 top models; values = billed output_tokens + reconstructed input_tokens (tiktoken o200k / count_tokens API / provider tokenizers) x per-model official API prices. Tested: production build compiles; a data-layer test against the real CSVs verifies merge, missing->"—", the cost-only filter (14 rows), and null-to-bottom sorting (cheapest first = deepseek-v4-pro). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optional cost & token-efficiency columns to the leaderboard#41

Add optional cost & token-efficiency columns to the leaderboard#41
lakshvantb wants to merge 1 commit into
mainfrom
cost-efficiency-columns

lakshvantb commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lakshvantb commented Jun 17, 2026

What

How partial coverage is handled (top-models-only)

Data

Testing

Notes / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant