Skip to content

v4.1.0 — audio quality + read cache#20

Draft
MKS-01 wants to merge 13 commits into
mainfrom
optimisation
Draft

v4.1.0 — audio quality + read cache#20
MKS-01 wants to merge 13 commits into
mainfrom
optimisation

Conversation

@MKS-01

@MKS-01 MKS-01 commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Summary

v4.1.0 — audio quality, synthesis performance, and developer experience.

Audio quality

  • Degenerate-chunk guard — all-silence chunks retry synthesis once before being dropped
  • Crossfade joins — 100 ms linear fade-out at chunk tails smooths the voiced→silence transition
  • Peak normalization already shipped; these complete the audio post-processing chain

Performance

  • Read cache — skip the entire pipeline (fetch → summarize → synthesize) on re-reads. Cache key = (url, mode, voice, llm_model) with composite index; only hits when the WAV still exists on disk
  • Faster synthesis — default precision fp32 → bf16 (~6% faster), chunk cap 280 → 400 chars (~30% fewer CSM prefills), sampler cached per (temperature, top_k). Speed/quality preset guide in config.yaml
  • New llm_model column on the reads table (auto-migrated for existing DBs)

CLI

  • Generation timer — player metadata line shows "Xs to generate" for live reads
  • Library UI revamp — every row shows mode · duration · words · date inline (accent blue on active); space previews audio inline without leaving the library; enter opens the full player
  • Venv auto-detect — server spawn uses .venv/bin/python3 -m readback directly (no activation needed); stderr captured so startup crashes show the actual error
  • Fixed library migration ordering (ALTER TABLE before index creation)

Tests & CI

  • Trimmed test suite: 59 → 38 (removed redundant/defensive tests)
  • New docs/TESTS.md — full catalogue of every test case by module
  • CI publishes JUnit test summary on PRs; Python matrix trimmed to 3.10 + 3.12

Docs

  • All doc surfaces synced (CLAUDE.md, README, ARCHITECTURE, ROADMAP, PLAN, SETUP, TESTS)
  • README updated: MLX in-process references throughout, bf16 default
  • JOURNEY.md fully rewritten — complete prose devlog, no scaffold prompts
  • Finetune README rewritten with clone-vs-LoRA table and quick-start
  • Version bumped across all 4 anchors

Before merging

  • Update screenshots — CLI player (generation timer), library view (inline metadata + preview indicator) have changed. Recapture docs/media/cli-player.png and docs/media/cli-home.png, then run the refresh-screenshots skill.

Test plan

  • 38 pytest pass
  • Smoke test: read an article, quit, re-read — second read instant (cache hit)
  • /model switch → re-read should NOT cache-hit
  • Listen for smoother chunk joins (crossfade)
  • readback-cli from cold — server spawns via venv auto-detect
  • Generation timer visible on player screen
  • /lib → space previews inline, enter opens full player

🤖 Generated with Claude Code

MKS-01 and others added 13 commits June 24, 2026 00:26
Three audio-quality and performance improvements:

- Degenerate-chunk guard: when CSM produces all-silence audio,
  retry synthesis once before dropping the chunk
- Light crossfade: 100ms linear fade-out at chunk tails smooths
  the transition into inter-chunk silence gaps
- Read cache: skip the entire pipeline (fetch/summarize/synthesize)
  when re-reading the same (url, mode, voice, llm_model) — the
  library lookup returns the existing WAV instantly. Adds llm_model
  column to the reads table (auto-migrated for existing DBs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers find_cached(source_url, mode, voice, llm_model) so the
cache check is an index seek instead of a table scan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update project tree, server pipeline, chunk+synth, and library
sections in CLAUDE.md to document the new features. Add PLAN.md
entry for the 2026-06-24 optimisation work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Audio quality + performance release: read cache, degenerate-chunk
guard, crossfade joins. Version anchors and docs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove redundant/defensive tests (59 → 38). Add docs/TESTS.md
cataloguing every test case by module. CI now publishes a JUnit
test summary on PRs and runs only Python 3.10 + 3.12.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update Python version refs from 3.10–3.12 → 3.10 + 3.12 in
CLAUDE.md, ROADMAP.md, SETUP.md. Add TESTS.md to project tree.
Update test count and CI description (JUnit summary).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Synthesis speed tuning (3 knobs, documented in config.yaml):
- Default precision fp32 → bf16 (~6% faster, no audible quality loss)
- Chunk cap 280 → 400 chars (fewer CSM prefills, ~30% fewer chunks)
- Sampler cached per (temperature, top_k) — avoids recreation per chunk

CLI: player metadata line now shows "Xs to generate" for live reads
(timings field added to DoneMsg; hidden for library replays).

Server spawn: readbackBin() now prefers `.venv/bin/python3 -m readback`
(works without venv activation); stderr captured so startup crashes
show the actual error. Fixed library migration ordering — ALTER TABLE
for llm_model now runs before the cache index that references it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update CLAUDE.md (chunk cap 400, bf16 default, sampler cache,
venv spawn, generation timer in PlayerView), README.md (bf16
default in stack table + config table), ARCHITECTURE.md (400-char
chunks, fade-out, retry), ROADMAP.md (faster synthesis checked
off), PLAN.md (new entry). Fix all stale fp32/280 references.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clarify that all three models (summary LLM, vision OCR, TTS) run
in-process on Metal via mlx-lm/mlx-vlm/csm-mlx — no external
daemons, no API keys. Added cache-hit note to the pipeline steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add v3.2.0 (Pi), v4.0.0 (MLX in-process), v4.1.0 (cache + speed)
to the timeline. Fix stack snapshot: Ollama → mlx-lm (in-process),
fp32 → bf16, add vision OCR and read cache. Update afplay/uvicorn
gotchas to match current behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the TODO-scaffold devlog with complete prose. Timeline
through v4.1.0, narrative arc with the two pivots explained,
honest workflow section, six technical decisions with reasoning,
clean gotcha list, and a closing takeaway. Stack snapshot tagged
v4.1.0 with Pi layer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrote src/finetune/README.md with a why-LoRA-vs-clone table,
quick-start section, detailed steps, tuning knobs, and revert
instructions. Added data/.gitkeep so the training data directory
is tracked (clips are gitignored). Pipeline is end-to-end ready —
just needs audio clips.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Library view now shows mode · duration · words · date on every row
(active row in accent blue). Space toggles inline audio preview
(plays without leaving the library; ♫ + elapsed shown on the row).
Enter still opens the full player. Back/esc stops any preview.

Descoped paste-raw-text and local-docs to Later in the roadmap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant