Skip to content

probe(pz2): tANS entropy accounting — per-block tANS passes (−0.25pp), global tables dead#153

Merged
ChrisLundquist merged 1 commit into
masterfrom
claude/tans-entropy-probe
Jun 10, 2026
Merged

probe(pz2): tANS entropy accounting — per-block tANS passes (−0.25pp), global tables dead#153
ChrisLundquist merged 1 commit into
masterfrom
claude/tans-entropy-probe

Conversation

@ChrisLundquist

Copy link
Copy Markdown
Owner

What

Task #13: the entropy-accounting probe answering "would tANS or another entropy coder beat the 8-lane Huffman if we had global state tables?" Pure histogram math — no codec or wire changes. examples/pz2_entropy_probe.rs prices four scenarios bit-exactly on the exact lane streams the encoder feeds its Huffman lanes (via two #[doc(hidden)] hooks: pz2::probe_lane_streams, pz2::probe_huffman_lengths), mirroring shipped headers and CONST/RAW fallbacks.

Results (Δpp of input vs shipped; 2 MiB blocks, greedy, 32 MiB segments)

input shipped total B tANS/block C global tables C' per-block choice D order-1/seg
blob 20.749pp −0.255 +0.894 −0.066 −0.055
dickens 13.580pp −0.548 −0.546 −0.546 −0.508
webster 11.853pp −0.307 −0.310 −0.310 −0.299
mozilla 28.129pp −0.206 +0.200 −0.116 −0.406
sao 66.365pp −0.432 −0.386 −0.386 −0.467
xml 5.100pp −0.125 ~0.000 −0.083 −0.098

Verdicts

  1. Per-block tANS passes the 0.15–0.2pp gate (−0.255pp blob) — and it needs no global tables at all. Lane split: ll −0.114 / lit −0.059 / ml −0.054 / of −0.027. The win is the Huffman 1-bit floor on the small-alphabet sequence-code lanes (greedy parses push lit_run = 0 far past p = 0.5); the 8-lane Huffman literals are near-optimal (huff0's classic result reconfirmed).
  2. Global tables are DEAD — the hypothesis inverted. +0.89pp on the blob: 2 MiB per-block histograms are already statistically saturated, and heterogeneous segments poison shared tables (the global-head-dict lesson again).
  3. Order-1 segment-global is DEAD at blob scope (−0.055pp). Honest outlier recorded: mozilla −0.41pp.

Follow-up surfaced (not in this PR)

Per-block tANS/FSE on the three seq-code lanes, ~−0.20pp realized after table quantization — gated on a decode-speed-neutral prototype of the fused splice with interleaved FSE states (zstd's exact design is the existence proof, but pz2's 12 GiB/s all-cores / 1.4 GB/s ST must hold).

Ledger: clean-slate-codec.md §12; CLAUDE.md dead-end entry added. Suite: 745 + 598 green, fmt/clippy clean.

🤖 Generated with Claude Code

…, global tables and order-1 are DEAD

Answers the dict-tier-era question "would tANS beat the 8-lane Huffman
if we had global state tables?" with bit-exact histogram pricing
(examples/pz2_entropy_probe.rs + two doc(hidden) hooks exposing the
exact lane streams and the shipped package-merge lengths). Scenarios:
shipped per-block Huffman (A), per-block tANS Shannon ideal with A's
own headers (B), segment-global tables (C, C' = per-block choice), and
order-1 segment-global on the seq-code lanes (D).

Blob verdict: B -0.255pp (gate-passing), C +0.894pp (WORSE — 2 MiB
per-block histograms are already saturated and heterogeneous segments
poison shared tables, the global-head-dict lesson again), C' -0.066pp,
D -0.055pp. Per file B ranges -0.13 (xml) to -0.55 (dickens). Lane
split: ll -0.114 / lit -0.059 / ml -0.054 / of -0.027 — the win is the
Huffman 1-bit floor on the small-alphabet sequence-code lanes (greedy
parses make lit_run=0 dominate far past p=0.5); the 8-lane Huffman
literals are near-optimal.

So the hypothesis inverted: tANS yes, global tables no. The live
follow-up is per-block tANS on the three seq-code lanes (~-0.20pp
realized after table quantization), gated on a decode-speed-neutral
prototype of the fused splice with FSE states. Ledger:
clean-slate-codec.md §12; CLAUDE.md dead-end entry added.

Suite: 745 + 598 green, fmt/clippy clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant