Sylph integration: graded absence weight, full-reconciliation mode, and strain-level brainstorm by wwood · Pull Request #300 · wwood/singlem

wwood · 2026-06-11T17:18:39Z

Builds on the two existing sylph-integration paths in condense (Regime-3 additive injection and the --joint NNLS deconvolution) with two correctness fixes, a new integration mode, and a design note.

1. Fix: graded sylph-absence weight in the joint deconvolution

The joint method's absence constraint and the min_markers identifiability floor overlapped. Empirically, the floor alone does all the suppression and shared-coverage routing, so the uniform absence weight (default 100) only acted on species that passed the floor — i.e. species with ≥ min_markers uniquely-assigned markers and so genuine SingleM evidence — crushing them toward zero. A species SingleM saw on 3 unique markers at coverage 5.0, but which sylph missed, collapsed to 0.14; even 10 markers reached only 0.43. These are exactly the divergent or low-abundance genomes (below sylph's ANI cutoff or coverage floor) that SingleM exists to detect.

The per-species absence weight now decays with unique-marker support (w_abs/(1+u)³): strong only for shared-window riders with no unique support (so routing still works when the floor is disabled) and negligible once several markers resolve to a species uniquely. The same species now recovers 3.13 (3 markers) → 4.91 (10 markers); floor-driven routing is unchanged.

2. Fix: waive the marker-count padding for sylph-detected species

The marker-count padding shrinks a species seen on few markers toward zero (reproducing the trimmed mean's zero-padding). That is an anti-false-positive prior for when SingleM markers are the only evidence; when sylph independently detects the species the unobserved markers are recovery dropout, and the penalty only opposes the sylph row — systematically under-estimating a sylph-detected species seen on few markers (and inflating the fitted α to compensate). Such species columns are now exempt, consistent with their exemption from the floor and the absence weight. On the benchmark below this lifts the under-observed species from 2.5 to 7.9, drops the joint L1 from 6.2 → 0.8, and brings the end-to-end α estimate from 9.5 to a realistic 3.0.

3. New mode: `--sylph-reconcile` (full reconciliation)

Additive injection captures sylph's sensitivity (species SingleM missed) but discards its precision for species both tools detect, leaving them at SingleM's estimate even when SingleM under-resolved them. The new mode generalises injection with one signed-update rule: each sylph species' leaf is raised to max(SingleM, α·eff_cov), the increase drawn from the same clade novel budget. It is one-sided (a species is never reduced) and order-independent, so SingleM's per-clade totals stay a conserved floor while sylph sets the within-clade partition wherever it sees more. Injection becomes the include_shared=False special case of one shared engine. Wired through condense, pipe, and renew (--sylph-reconcile).

4. Strain-level brainstorm (`strain_level_brainstorm.md`)

A design note on extending the profile below species to strain level under a "compress once, assign many" discipline — compress pipe's reads reference-free, then assign strain taxonomy from the compressed form alone against swappable genome databases. It observes that SingleM's archive OTU table (HMM-defined window haplotypes, re-assignable by renew) and sylph's FracMinHash sketch already separate compression from assignment at species level, and proposes five strain-level extensions, all entering the profile via the rank-generalised form of the reconciliation in this PR.

Validation

pixi run -e dev pytest test/test_condense.py test/test_condense_viral.py → 41 passed, 2 skipped (the two need the full GTDB metapackage). All sylph CLI modes smoke-tested end-to-end on a small metapackage, with unit tests for the absence grading, the padding exemption, and the reconciliation lift / keep-SingleM / inject behaviours.

On a synthetic genus run end-to-end through the pipeline (one species under-recovered by SingleM on part of its markers, an indistinguishable pair dissolving to genus-novelty, and a species detected by neither tool), species-coverage L1 error is 0.8 (joint), 3.4 (reconcile), 5.3 (injection), 25.6 (SingleM alone). With the padding fix the joint method is the most accurate here (its sylph constraint resolves the indistinguishable pair and the partially-observed species, and it avoids the genus push-down that inflates the abundant species under the tree methods); the ranking is scenario-dependent, and reconciliation remains the most robust by construction — never reducing a SingleM call, never discarding marker evidence, and retaining the validated EM where sylph is silent.

The two joint fixes change --joint's default behaviour (the --joint-absence-weight flag and value are unchanged but reinterpreted through the decay; the padding exemption is unconditional). Both soften structural penalties that were over-riding sylph/SingleM evidence; easy to gate behind flags if you'd prefer either to remain authoritative for some runs.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

Two improvements to condense's sylph integration. 1. Joint deconvolution: make the sylph-absence penalty respect SingleM evidence. The identifiability floor already suppresses weakly-supported sylph-absent species and routes their shared coverage to novelty, so the uniform absence weight (default 100) only acted on species that *passed* the floor -- those with >= min_markers uniquely-assigned markers and so genuine SingleM support -- crushing them toward zero (a 3-unique-marker species at coverage 5 collapsed to 0.14). These are exactly the divergent or low-abundance genomes SingleM detects but sylph cannot. The per-species absence weight now decays with unique-marker support, w_abs/(1+u)^3, so it is strong only for shared-window riders (when the floor is disabled) and negligible once several markers resolve to a species uniquely. That species now recovers 3.13 (3 markers), rising toward truth as support grows, while floor-driven routing is unchanged. 2. Add --sylph-reconcile, a third integration mode generalising the additive injection: besides adding sylph-only species, it lifts species both tools detected up to sylph's coverage where sylph credits more, drawing the increase from the same clade novel budget. The update is one-sided (a species is never reduced) and order-independent, so SingleM's per-clade totals stay a conserved floor while sylph sets the within-clade partition. Injection becomes the include_shared=False special case of one shared engine. On a synthetic genus where SingleM under-resolves two species (coverage diluted to the genus node) and misses a third, reconciliation cut the species-coverage L1 error to 0.5 vs 11.0 for injection and 14.0 for SingleM alone. Wired through condense, pipe and renew (--sylph-reconcile), with unit tests for the lift/keep/inject behaviours and the methods document updated. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

…sion Design note exploring how to extend the condensed community profile below species to strain level, under a "compress once, assign many" discipline: compress pipe's reads without reference to any genome database, then assign strain-level taxonomy from the compressed form alone against swappable reference databases. Notes that SingleM's archive OTU table (HMM-defined marker-window haplotypes, re-assignable by renew) and sylph's FracMinHash sketch (--output-sylph-sketch / --input-sylph-sketch) already separate compression from assignment, and proposes concrete strain-level extensions: marker-window haplotype strains (reference-free, swappable allele DB; de novo clustering), genome-sketch strains via strain-panel databases, a unified sample-sketch artifact, an SNV fingerprint, and denser minimizer/unitig sketches. Explains how strain coverage enters the profile via the rank-generalised clade-budget reconciliation now in condense. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

Replace the earlier illustrative condensed-tree L1 figures with numbers from a synthetic genus run through the real condense pipeline, so the joint deconvolution (which consumes raw marker windows and so could not be evaluated on the tree-level scenario) is included. Reconciliation 3.4, injection 5.3, joint 6.2, SingleM alone 25.6; notes that the ordering is scenario-dependent (joint is strongest under window sharing) and that joint under-recovers the partially-observed species because its marker-count and robustness penalties down-weight that species' marker rows. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1a41a5887

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ecies The marker-count padding shrinks a column observed on few markers toward zero (reproducing the trimmed mean's zero-padding). That is an anti-false-positive prior for when SingleM markers are the only evidence; when sylph independently detects the species, the unobserved markers are recovery dropout, not evidence of absence, and the penalty only opposes the sylph row -- systematically under-estimating a sylph-detected species seen on few markers and inflating the fitted alpha to compensate. Such species columns are now exempt (padding 0), consistent with their exemption from the identifiability floor and the strong absence weight. On the synthetic genus benchmark this lifts the under-observed species (truth 8, seen on 4/6 markers) from 2.5 to 7.9 and drops the joint L1 error from 6.2 to 0.8 (now the most accurate of the four paths); the end-to-end alpha estimate falls from 9.5 to a realistic 3.0. Adds a unit test (sylph-detected few-marker species recovers its coverage; sylph-absent one stays shrunk) and updates the methods document. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

Sylph integration is a non-viral (GTDB-genome) feature that never runs in Lyrebird's viral pipe/renew workflows, and Lyrebird forwards none of the sylph arguments. Because the sylph flags were added unconditionally in the shared add_common_pipe_arguments, `lyrebird pipe`/`lyrebird renew` accepted --sylph-injection / --sylph-reconcile / --no-sylph / --output-sylph-sketch and silently ignored them. Gate the sylph flags behind a new include_sylph parameter (default True) and pass include_sylph=False from Lyrebird, so Lyrebird now rejects these flags rather than accepting them as no-ops. SingleM pipe/renew/condense are unaffected. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

Add condense_sylph_methods.docx, a paper-ready Methods section describing the three sylph-integration strategies (additive injection, full reconciliation, joint NNLS deconvolution), generated from joint_condense_methods.md via pandoc with native Word equations. Tidy the markdown source for paper form: fix the strategy ordinal now that reconciliation is described second, drop a development-history aside, and move the synthetic-genus L1 comparison from the reconciliation algorithm description into the in-silico validation section. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

claude added 3 commits June 11, 2026 12:11

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread singlem/main.py Outdated

claude added 3 commits June 11, 2026 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sylph integration: graded absence weight, full-reconciliation mode, and strain-level brainstorm#300

Sylph integration: graded absence weight, full-reconciliation mode, and strain-level brainstorm#300
wwood wants to merge 6 commits into
sylph-condense-regime3from
claude/festive-darwin-tfqmoq

wwood commented Jun 11, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wwood commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Fix: graded sylph-absence weight in the joint deconvolution

2. Fix: waive the marker-count padding for sylph-detected species

3. New mode: --sylph-reconcile (full reconciliation)

4. Strain-level brainstorm (strain_level_brainstorm.md)

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wwood commented Jun 11, 2026 •

edited

Loading

3. New mode: `--sylph-reconcile` (full reconciliation)

4. Strain-level brainstorm (`strain_level_brainstorm.md`)