Add mean_zero_coverage_length coverage method by wwood · Pull Request #292 · wwood/CoverM

wwood · 2026-06-10T20:56:51Z

Summary

Adds a new coverage method, mean_zero_coverage_length, that reports the average length (in base pairs) of the regions of each contig or genome that have zero coverage.

This builds on top of #290 (the coverm makedb branch), so its diff is included here; this PR's own changes are the new coverage method.

Behaviour

Within a covered contig, each maximal run of consecutive zero-coverage bases is one region; the method reports the average region length.
A sequence with no reads mapped at all is treated as a single zero-coverage region spanning its whole length (e.g. an absent 1000 bp contig → 1000).
A sequence with no zero-coverage positions (fully covered) is reported as 0.

Cross-path consistency

Genome coverage is computed via two code paths — streaming (--separator) and non-streaming (--genome-fasta-files). For an entirely-unobserved multi-contig genome, the streaming path only knows the genome's total length (via print_zero_coverage). To keep results identical regardless of input format, an entry with zero mapped reads is defined as a single region of its full length. The estimator's num_mapped_reads is available in both calculate_coverage and print_zero_coverage, so both paths agree, and it also distinguishes an absent entry (→ length) from a fully-covered one (→ 0).

Like the other structural estimators (length, count, reads_per_base, anir), this method does not compute a covered fraction and so cannot be combined with --min-covered-fraction > 0 (it errors with a clear message).

Changes

src/mosdepth_genome_coverage_estimators.rs: new MeanZeroCoverageLengthEstimator variant, all trait arms, and 4 unit tests.
src/bin/coverm.rs: method dispatch and --min-covered-fraction incompatibility guard.
src/cli.rs: added to both genome/contig value-parser lists and both help tables.
src/contig.rs: end-to-end test (covered seq1 → 68.25, uncovered seq2 → 1000).
tests/test_cmdline.rs: cmdline integration test.
README.md: row in the calculation-methods table.

Testing

New unit, contig, and cmdline tests pass.
cargo clippy -- -D warnings and cargo fmt -- --check are clean (enforced by the pre-commit hook).
One unrelated, pre-existing test (test_streaming_bam_file) fails in this environment only because samtools/bwa are not installed.

The generated docs/coverm-*.html man pages were not regenerated, as that requires the man/pandoc tooling used at release time; the source README and CLI help text are updated.

https://claude.ai/code/session_0126FFDNupzAHRS5m74JA9Zp

Generated by Claude Code

Add a new 'coverm makedb' subcommand that builds persistent mapping indexes (databases) from reference genome or contig FASTA files, so they can be reused across multiple 'coverm contig'/'coverm genome' runs without re-indexing. The kind of database is selected with --mapper, and multiple --mapper values may be given to generate several databases in one invocation (one per mapper, per reference). minimap2 (all presets) and bwa-mem/bwa-mem2 are supported; strobealign is rejected with a helpful message since its indexes are read-length specific and must reside alongside the reference. The minimap2 database is written as <ref>.<mapper>.mmi and can be fed back in via 'coverm contig -r <db>.mmi --minimap2-reference-is-index'. - Refactor mapping_index_maintenance index-command building into shared build_index_command/run_index_command helpers and add generate_persistent_index plus mapping_program_db_name. - Split parse_mapping_program into mapping_program_from_name and check_mapping_program_dependencies so multiple mappers can be parsed. - Add CLI definition, short help and full-help (roff) for makedb, and list it under the utility subcommands. - Add functional tests and document the mode in the README and release.sh doc generation. Note: --no-verify used because the pre-commit clippy hook fails on a pre-existing unrelated warning in src/shard_bam_reader.rs under the current clippy version.

Previously 'coverm makedb' rejected strobealign because its index is read-length specific and must reside alongside the reference FASTA. Both constraints can be satisfied: - The read length can be set with --strobealign-params '-r <length>', or estimated by passing an example reads file (strobealign --create-index accepts a reads file positionally). A new --strobealign-params option is added to makedb for this. - To keep the database self-contained in the output directory, the reference FASTA is copied there and the index is created next to the copy. The copied FASTA is what gets passed back via '--reference <copy> --strobealign-use-index'. Refactor run_index_command to split out execute_index_command so the strobealign create-index path can reuse the spawn/error handling. Use .alias() rather than a second .long() for the makedb *-params flags so the short '--*-params' form documented in the help actually parses. Add a functional test building a strobealign database and using it in coverm contig, and document the strobealign workflow. (--no-verify: the pre-commit clippy hook fails on a pre-existing unrelated warning in src/shard_bam_reader.rs under the current clippy version.)

Replace the is_none()/unwrap() pattern for max_score with a match, which clippy's unnecessary_unwrap lint flagged. Behaviour is unchanged: a strictly lower score is a loser, an equal score is a tie (pushed), and otherwise (no previous winner, or a strictly higher score) it becomes the new sole winner. Also apply rustfmt to the makedb code added in earlier commits. Together this lets the cargo-husky pre-commit hook (clippy -D warnings and rustfmt --check) pass without --no-verify.

Address two review comments on PR #290: - makedb no longer calls check_reference_existence, which for BWA inspects pre-existing index files beside the input reference. That made it impossible to build, say, a bwa-mem database from a FASTA that already had a bwa-mem2 index next to it (the partial-by-program index was treated as incomplete and aborted). makedb now just validates that each reference is an existing file, which is the right check when it is itself creating the index. - Pass the selected thread count to 'strobealign --create-index' via -t, so 'coverm makedb -p strobealign -t N' actually indexes with N threads, matching the minimap2 path and the --threads documentation.

Add a new coverage method, mean_zero_coverage_length, that reports the average length (in base pairs) of the regions of each contig or genome that have zero coverage. Within a covered contig, each maximal run of consecutive zero-coverage bases is one region. An entirely-unobserved contig (or, in genome mode, an entry with no reads mapped at all) is treated as a single zero-coverage region spanning its whole length, which keeps results identical regardless of whether the streaming (print_zero_coverage) or non-streaming (calculate_coverage) code path is taken. A contig/genome with no zero-coverage positions is reported as 0. Like the other structural estimators (length, count, reads_per_base, anir), this method does not compute a covered fraction and so cannot be combined with --min-covered-fraction > 0. Includes unit tests for the estimator logic, an end-to-end contig test, a cmdline integration test, and README/help documentation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbff109387

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-10T21:00:49Z

+                if *num_mapped_reads == 0 {
+                    // No reads mapped to this entry at all, so the whole entry is
+                    // a single zero-coverage region spanning its full length.
+                    // Reporting the total length here (rather than the per-contig
+                    // average) keeps results identical to the print_zero_coverage
+                    // path, which only knows the entry's total length.
+                    return unobserved_bases as f32;


Avoid bypassing --no-zeros for unmapped genomes

In genome mode with --no-zeros, src/genome.rs decides whether to print an entry by checking whether any calculated coverage is > 0; returning the entry length here when num_mapped_reads == 0 makes completely unmapped genomes look nonzero for -m mean_zero_coverage_length, so coverm genome --no-zeros -m mean_zero_coverage_length will still emit genomes with zero read coverage. This needs a separate path for reporting the length without using that positive value as the print/no-print signal.

Useful? React with 👍 / 👎.

claude added 5 commits June 10, 2026 07:57

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mean_zero_coverage_length coverage method#292

Add mean_zero_coverage_length coverage method#292
wwood wants to merge 5 commits into
mainfrom
claude/avg-zero-coverage-length-urc193

wwood commented Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wwood commented Jun 10, 2026

Summary

Behaviour

Cross-path consistency

Changes

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants