Sylph integration: graded absence weight, full-reconciliation mode, and strain-level brainstorm#300
Open
wwood wants to merge 6 commits into
Open
Sylph integration: graded absence weight, full-reconciliation mode, and strain-level brainstorm#300wwood wants to merge 6 commits into
wwood wants to merge 6 commits into
Conversation
Two improvements to condense's sylph integration. 1. Joint deconvolution: make the sylph-absence penalty respect SingleM evidence. The identifiability floor already suppresses weakly-supported sylph-absent species and routes their shared coverage to novelty, so the uniform absence weight (default 100) only acted on species that *passed* the floor -- those with >= min_markers uniquely-assigned markers and so genuine SingleM support -- crushing them toward zero (a 3-unique-marker species at coverage 5 collapsed to 0.14). These are exactly the divergent or low-abundance genomes SingleM detects but sylph cannot. The per-species absence weight now decays with unique-marker support, w_abs/(1+u)^3, so it is strong only for shared-window riders (when the floor is disabled) and negligible once several markers resolve to a species uniquely. That species now recovers 3.13 (3 markers), rising toward truth as support grows, while floor-driven routing is unchanged. 2. Add --sylph-reconcile, a third integration mode generalising the additive injection: besides adding sylph-only species, it lifts species both tools detected up to sylph's coverage where sylph credits more, drawing the increase from the same clade novel budget. The update is one-sided (a species is never reduced) and order-independent, so SingleM's per-clade totals stay a conserved floor while sylph sets the within-clade partition. Injection becomes the include_shared=False special case of one shared engine. On a synthetic genus where SingleM under-resolves two species (coverage diluted to the genus node) and misses a third, reconciliation cut the species-coverage L1 error to 0.5 vs 11.0 for injection and 14.0 for SingleM alone. Wired through condense, pipe and renew (--sylph-reconcile), with unit tests for the lift/keep/inject behaviours and the methods document updated. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
…sion Design note exploring how to extend the condensed community profile below species to strain level, under a "compress once, assign many" discipline: compress pipe's reads without reference to any genome database, then assign strain-level taxonomy from the compressed form alone against swappable reference databases. Notes that SingleM's archive OTU table (HMM-defined marker-window haplotypes, re-assignable by renew) and sylph's FracMinHash sketch (--output-sylph-sketch / --input-sylph-sketch) already separate compression from assignment, and proposes concrete strain-level extensions: marker-window haplotype strains (reference-free, swappable allele DB; de novo clustering), genome-sketch strains via strain-panel databases, a unified sample-sketch artifact, an SNV fingerprint, and denser minimizer/unitig sketches. Explains how strain coverage enters the profile via the rank-generalised clade-budget reconciliation now in condense. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
Replace the earlier illustrative condensed-tree L1 figures with numbers from a synthetic genus run through the real condense pipeline, so the joint deconvolution (which consumes raw marker windows and so could not be evaluated on the tree-level scenario) is included. Reconciliation 3.4, injection 5.3, joint 6.2, SingleM alone 25.6; notes that the ordering is scenario-dependent (joint is strongest under window sharing) and that joint under-recovers the partially-observed species because its marker-count and robustness penalties down-weight that species' marker rows. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1a41a5887
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ecies The marker-count padding shrinks a column observed on few markers toward zero (reproducing the trimmed mean's zero-padding). That is an anti-false-positive prior for when SingleM markers are the only evidence; when sylph independently detects the species, the unobserved markers are recovery dropout, not evidence of absence, and the penalty only opposes the sylph row -- systematically under-estimating a sylph-detected species seen on few markers and inflating the fitted alpha to compensate. Such species columns are now exempt (padding 0), consistent with their exemption from the identifiability floor and the strong absence weight. On the synthetic genus benchmark this lifts the under-observed species (truth 8, seen on 4/6 markers) from 2.5 to 7.9 and drops the joint L1 error from 6.2 to 0.8 (now the most accurate of the four paths); the end-to-end alpha estimate falls from 9.5 to a realistic 3.0. Adds a unit test (sylph-detected few-marker species recovers its coverage; sylph-absent one stays shrunk) and updates the methods document. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
Sylph integration is a non-viral (GTDB-genome) feature that never runs in Lyrebird's viral pipe/renew workflows, and Lyrebird forwards none of the sylph arguments. Because the sylph flags were added unconditionally in the shared add_common_pipe_arguments, `lyrebird pipe`/`lyrebird renew` accepted --sylph-injection / --sylph-reconcile / --no-sylph / --output-sylph-sketch and silently ignored them. Gate the sylph flags behind a new include_sylph parameter (default True) and pass include_sylph=False from Lyrebird, so Lyrebird now rejects these flags rather than accepting them as no-ops. SingleM pipe/renew/condense are unaffected. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
Add condense_sylph_methods.docx, a paper-ready Methods section describing the three sylph-integration strategies (additive injection, full reconciliation, joint NNLS deconvolution), generated from joint_condense_methods.md via pandoc with native Word equations. Tidy the markdown source for paper form: fix the strategy ordinal now that reconciliation is described second, drop a development-history aside, and move the synthetic-genus L1 comparison from the reconciliation algorithm description into the in-silico validation section. https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on the two existing sylph-integration paths in
condense(Regime-3 additive injection and the--jointNNLS deconvolution) with two correctness fixes, a new integration mode, and a design note.1. Fix: graded sylph-absence weight in the joint deconvolution
The joint method's absence constraint and the
min_markersidentifiability floor overlapped. Empirically, the floor alone does all the suppression and shared-coverage routing, so the uniform absence weight (default 100) only acted on species that passed the floor — i.e. species with ≥min_markersuniquely-assigned markers and so genuine SingleM evidence — crushing them toward zero. A species SingleM saw on 3 unique markers at coverage 5.0, but which sylph missed, collapsed to 0.14; even 10 markers reached only 0.43. These are exactly the divergent or low-abundance genomes (below sylph's ANI cutoff or coverage floor) that SingleM exists to detect.The per-species absence weight now decays with unique-marker support (
w_abs/(1+u)³): strong only for shared-window riders with no unique support (so routing still works when the floor is disabled) and negligible once several markers resolve to a species uniquely. The same species now recovers 3.13 (3 markers) → 4.91 (10 markers); floor-driven routing is unchanged.2. Fix: waive the marker-count padding for sylph-detected species
The marker-count padding shrinks a species seen on few markers toward zero (reproducing the trimmed mean's zero-padding). That is an anti-false-positive prior for when SingleM markers are the only evidence; when sylph independently detects the species the unobserved markers are recovery dropout, and the penalty only opposes the sylph row — systematically under-estimating a sylph-detected species seen on few markers (and inflating the fitted α to compensate). Such species columns are now exempt, consistent with their exemption from the floor and the absence weight. On the benchmark below this lifts the under-observed species from 2.5 to 7.9, drops the joint L1 from 6.2 → 0.8, and brings the end-to-end α estimate from 9.5 to a realistic 3.0.
3. New mode:
--sylph-reconcile(full reconciliation)Additive injection captures sylph's sensitivity (species SingleM missed) but discards its precision for species both tools detect, leaving them at SingleM's estimate even when SingleM under-resolved them. The new mode generalises injection with one signed-update rule: each sylph species' leaf is raised to
max(SingleM, α·eff_cov), the increase drawn from the same clade novel budget. It is one-sided (a species is never reduced) and order-independent, so SingleM's per-clade totals stay a conserved floor while sylph sets the within-clade partition wherever it sees more. Injection becomes theinclude_shared=Falsespecial case of one shared engine. Wired throughcondense,pipe, andrenew(--sylph-reconcile).4. Strain-level brainstorm (
strain_level_brainstorm.md)A design note on extending the profile below species to strain level under a "compress once, assign many" discipline — compress
pipe's reads reference-free, then assign strain taxonomy from the compressed form alone against swappable genome databases. It observes that SingleM's archive OTU table (HMM-defined window haplotypes, re-assignable byrenew) and sylph's FracMinHash sketch already separate compression from assignment at species level, and proposes five strain-level extensions, all entering the profile via the rank-generalised form of the reconciliation in this PR.Validation
pixi run -e dev pytest test/test_condense.py test/test_condense_viral.py→ 41 passed, 2 skipped (the two need the full GTDB metapackage). All sylph CLI modes smoke-tested end-to-end on a small metapackage, with unit tests for the absence grading, the padding exemption, and the reconciliation lift / keep-SingleM / inject behaviours.On a synthetic genus run end-to-end through the pipeline (one species under-recovered by SingleM on part of its markers, an indistinguishable pair dissolving to genus-novelty, and a species detected by neither tool), species-coverage L1 error is 0.8 (joint), 3.4 (reconcile), 5.3 (injection), 25.6 (SingleM alone). With the padding fix the joint method is the most accurate here (its sylph constraint resolves the indistinguishable pair and the partially-observed species, and it avoids the genus push-down that inflates the abundant species under the tree methods); the ranking is scenario-dependent, and reconciliation remains the most robust by construction — never reducing a SingleM call, never discarding marker evidence, and retaining the validated EM where sylph is silent.
The two joint fixes change
--joint's default behaviour (the--joint-absence-weightflag and value are unchanged but reinterpreted through the decay; the padding exemption is unconditional). Both soften structural penalties that were over-riding sylph/SingleM evidence; easy to gate behind flags if you'd prefer either to remain authoritative for some runs.https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS