Skip to content

Sylph integration: graded absence weight, full-reconciliation mode, and strain-level brainstorm#300

Open
wwood wants to merge 6 commits into
sylph-condense-regime3from
claude/festive-darwin-tfqmoq
Open

Sylph integration: graded absence weight, full-reconciliation mode, and strain-level brainstorm#300
wwood wants to merge 6 commits into
sylph-condense-regime3from
claude/festive-darwin-tfqmoq

Conversation

@wwood

@wwood wwood commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Builds on the two existing sylph-integration paths in condense (Regime-3 additive injection and the --joint NNLS deconvolution) with two correctness fixes, a new integration mode, and a design note.

1. Fix: graded sylph-absence weight in the joint deconvolution

The joint method's absence constraint and the min_markers identifiability floor overlapped. Empirically, the floor alone does all the suppression and shared-coverage routing, so the uniform absence weight (default 100) only acted on species that passed the floor — i.e. species with ≥ min_markers uniquely-assigned markers and so genuine SingleM evidence — crushing them toward zero. A species SingleM saw on 3 unique markers at coverage 5.0, but which sylph missed, collapsed to 0.14; even 10 markers reached only 0.43. These are exactly the divergent or low-abundance genomes (below sylph's ANI cutoff or coverage floor) that SingleM exists to detect.

The per-species absence weight now decays with unique-marker support (w_abs/(1+u)³): strong only for shared-window riders with no unique support (so routing still works when the floor is disabled) and negligible once several markers resolve to a species uniquely. The same species now recovers 3.13 (3 markers) → 4.91 (10 markers); floor-driven routing is unchanged.

2. Fix: waive the marker-count padding for sylph-detected species

The marker-count padding shrinks a species seen on few markers toward zero (reproducing the trimmed mean's zero-padding). That is an anti-false-positive prior for when SingleM markers are the only evidence; when sylph independently detects the species the unobserved markers are recovery dropout, and the penalty only opposes the sylph row — systematically under-estimating a sylph-detected species seen on few markers (and inflating the fitted α to compensate). Such species columns are now exempt, consistent with their exemption from the floor and the absence weight. On the benchmark below this lifts the under-observed species from 2.5 to 7.9, drops the joint L1 from 6.2 → 0.8, and brings the end-to-end α estimate from 9.5 to a realistic 3.0.

3. New mode: --sylph-reconcile (full reconciliation)

Additive injection captures sylph's sensitivity (species SingleM missed) but discards its precision for species both tools detect, leaving them at SingleM's estimate even when SingleM under-resolved them. The new mode generalises injection with one signed-update rule: each sylph species' leaf is raised to max(SingleM, α·eff_cov), the increase drawn from the same clade novel budget. It is one-sided (a species is never reduced) and order-independent, so SingleM's per-clade totals stay a conserved floor while sylph sets the within-clade partition wherever it sees more. Injection becomes the include_shared=False special case of one shared engine. Wired through condense, pipe, and renew (--sylph-reconcile).

4. Strain-level brainstorm (strain_level_brainstorm.md)

A design note on extending the profile below species to strain level under a "compress once, assign many" discipline — compress pipe's reads reference-free, then assign strain taxonomy from the compressed form alone against swappable genome databases. It observes that SingleM's archive OTU table (HMM-defined window haplotypes, re-assignable by renew) and sylph's FracMinHash sketch already separate compression from assignment at species level, and proposes five strain-level extensions, all entering the profile via the rank-generalised form of the reconciliation in this PR.

Validation

pixi run -e dev pytest test/test_condense.py test/test_condense_viral.py → 41 passed, 2 skipped (the two need the full GTDB metapackage). All sylph CLI modes smoke-tested end-to-end on a small metapackage, with unit tests for the absence grading, the padding exemption, and the reconciliation lift / keep-SingleM / inject behaviours.

On a synthetic genus run end-to-end through the pipeline (one species under-recovered by SingleM on part of its markers, an indistinguishable pair dissolving to genus-novelty, and a species detected by neither tool), species-coverage L1 error is 0.8 (joint), 3.4 (reconcile), 5.3 (injection), 25.6 (SingleM alone). With the padding fix the joint method is the most accurate here (its sylph constraint resolves the indistinguishable pair and the partially-observed species, and it avoids the genus push-down that inflates the abundant species under the tree methods); the ranking is scenario-dependent, and reconciliation remains the most robust by construction — never reducing a SingleM call, never discarding marker evidence, and retaining the validated EM where sylph is silent.

The two joint fixes change --joint's default behaviour (the --joint-absence-weight flag and value are unchanged but reinterpreted through the decay; the padding exemption is unconditional). Both soften structural penalties that were over-riding sylph/SingleM evidence; easy to gate behind flags if you'd prefer either to remain authoritative for some runs.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

claude added 3 commits June 11, 2026 12:11
Two improvements to condense's sylph integration.

1. Joint deconvolution: make the sylph-absence penalty respect SingleM
   evidence. The identifiability floor already suppresses weakly-supported
   sylph-absent species and routes their shared coverage to novelty, so the
   uniform absence weight (default 100) only acted on species that *passed*
   the floor -- those with >= min_markers uniquely-assigned markers and so
   genuine SingleM support -- crushing them toward zero (a 3-unique-marker
   species at coverage 5 collapsed to 0.14). These are exactly the divergent
   or low-abundance genomes SingleM detects but sylph cannot. The per-species
   absence weight now decays with unique-marker support, w_abs/(1+u)^3, so it
   is strong only for shared-window riders (when the floor is disabled) and
   negligible once several markers resolve to a species uniquely. That species
   now recovers 3.13 (3 markers), rising toward truth as support grows, while
   floor-driven routing is unchanged.

2. Add --sylph-reconcile, a third integration mode generalising the additive
   injection: besides adding sylph-only species, it lifts species both tools
   detected up to sylph's coverage where sylph credits more, drawing the
   increase from the same clade novel budget. The update is one-sided (a
   species is never reduced) and order-independent, so SingleM's per-clade
   totals stay a conserved floor while sylph sets the within-clade partition.
   Injection becomes the include_shared=False special case of one shared
   engine. On a synthetic genus where SingleM under-resolves two species
   (coverage diluted to the genus node) and misses a third, reconciliation cut
   the species-coverage L1 error to 0.5 vs 11.0 for injection and 14.0 for
   SingleM alone.

Wired through condense, pipe and renew (--sylph-reconcile), with unit tests
for the lift/keep/inject behaviours and the methods document updated.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
…sion

Design note exploring how to extend the condensed community profile below
species to strain level, under a "compress once, assign many" discipline:
compress pipe's reads without reference to any genome database, then assign
strain-level taxonomy from the compressed form alone against swappable
reference databases.

Notes that SingleM's archive OTU table (HMM-defined marker-window haplotypes,
re-assignable by renew) and sylph's FracMinHash sketch (--output-sylph-sketch
/ --input-sylph-sketch) already separate compression from assignment, and
proposes concrete strain-level extensions: marker-window haplotype strains
(reference-free, swappable allele DB; de novo clustering), genome-sketch
strains via strain-panel databases, a unified sample-sketch artifact, an SNV
fingerprint, and denser minimizer/unitig sketches. Explains how strain
coverage enters the profile via the rank-generalised clade-budget
reconciliation now in condense.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
Replace the earlier illustrative condensed-tree L1 figures with numbers from a
synthetic genus run through the real condense pipeline, so the joint
deconvolution (which consumes raw marker windows and so could not be evaluated
on the tree-level scenario) is included. Reconciliation 3.4, injection 5.3,
joint 6.2, SingleM alone 25.6; notes that the ordering is scenario-dependent
(joint is strongest under window sharing) and that joint under-recovers the
partially-observed species because its marker-count and robustness penalties
down-weight that species' marker rows.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1a41a5887

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread singlem/main.py Outdated
claude added 3 commits June 11, 2026 18:07
…ecies

The marker-count padding shrinks a column observed on few markers toward zero
(reproducing the trimmed mean's zero-padding). That is an anti-false-positive
prior for when SingleM markers are the only evidence; when sylph independently
detects the species, the unobserved markers are recovery dropout, not evidence
of absence, and the penalty only opposes the sylph row -- systematically
under-estimating a sylph-detected species seen on few markers and inflating the
fitted alpha to compensate. Such species columns are now exempt (padding 0),
consistent with their exemption from the identifiability floor and the strong
absence weight.

On the synthetic genus benchmark this lifts the under-observed species (truth
8, seen on 4/6 markers) from 2.5 to 7.9 and drops the joint L1 error from 6.2
to 0.8 (now the most accurate of the four paths); the end-to-end alpha estimate
falls from 9.5 to a realistic 3.0. Adds a unit test (sylph-detected few-marker
species recovers its coverage; sylph-absent one stays shrunk) and updates the
methods document.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
Sylph integration is a non-viral (GTDB-genome) feature that never runs in
Lyrebird's viral pipe/renew workflows, and Lyrebird forwards none of the sylph
arguments. Because the sylph flags were added unconditionally in the shared
add_common_pipe_arguments, `lyrebird pipe`/`lyrebird renew` accepted
--sylph-injection / --sylph-reconcile / --no-sylph / --output-sylph-sketch and
silently ignored them.

Gate the sylph flags behind a new include_sylph parameter (default True) and
pass include_sylph=False from Lyrebird, so Lyrebird now rejects these flags
rather than accepting them as no-ops. SingleM pipe/renew/condense are
unaffected.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
Add condense_sylph_methods.docx, a paper-ready Methods section describing the
three sylph-integration strategies (additive injection, full reconciliation,
joint NNLS deconvolution), generated from joint_condense_methods.md via pandoc
with native Word equations. Tidy the markdown source for paper form: fix the
strategy ordinal now that reconciliation is described second, drop a
development-history aside, and move the synthetic-genus L1 comparison from the
reconciliation algorithm description into the in-silico validation section.

https://claude.ai/code/session_01XoyGoGa1pkW5urH47iJfoS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants