Skip to content

new module: custom/bed12codonpositions#11733

Open
pinin4fjords wants to merge 9 commits into
nf-core:masterfrom
pinin4fjords:custom-bed12codonpositions
Open

new module: custom/bed12codonpositions#11733
pinin4fjords wants to merge 9 commits into
nf-core:masterfrom
pinin4fjords:custom-bed12codonpositions

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 21, 2026

Why

A pipeline that needs codon-level genomic positions along a spliced transcript (ribo-seq P-site counts per codon, frame / periodicity QC, novel-ORF tiling) has nothing off-the-shelf to reach for. bedtools makewindows takes BED3 and ignores BED12 blocks; chaining bedtools bed12tobed6 | bedtools makewindows re-anchors each exon at offset 0, so codons crossing an intron land in the wrong frame. No existing nf-core module emits per-codon BED from a BED12.

What it does

For each BED12 record, walks the blocks in mRNA order (5'→3'), emits every --step-th mRNA position starting at --frame, and projects each back to genomic coordinates. With --width N > 1 it emits an N-nt span per position, splitting at block boundaries so a codon that crosses an intron becomes two rows whose union still maps to a contiguous mRNA region.

Worked example

Two-exon - strand record with one intron (spliced mRNA length 20):

chr1    100    210    geneX    500    -    100    210    0    2    10,10    0,100

Default args (--step 3 --width 1, one row per codon at its 5'-most nt) — i.e. the BED you'd intersect with an offset-corrected ribo-seq BAM to count P-sites per codon:

chr1    209    210    geneX    500    -
chr1    206    207    geneX    500    -
chr1    203    204    geneX    500    -
chr1    200    201    geneX    500    -
chr1    107    108    geneX    500    -
chr1    104    105    geneX    500    -
chr1    101    102    geneX    500    -

Rows are in 5'→3' mRNA order, so on the - strand the genomic coordinates count down. The jump from 200 back to 107 is the spliced intron.

With --width 3 (full 3-nt codon spans), the codon at mRNA position 9 straddles the intron — its 5' nucleotide is in the upstream exon and the other two are in the downstream exon, so it's split into two BED rows (4th and 5th below):

chr1    207    210    geneX    500    -
chr1    204    207    geneX    500    -
chr1    201    204    geneX    500    -
chr1    200    201    geneX    500    -
chr1    108    110    geneX    500    -
chr1    105    108    geneX    500    -
chr1    102    105    geneX    500    -

Score (column 5) is preserved. Frame is taken from the record's own start, so the module works regardless of any GTF phase annotation — running it three times with --frame 0/1/2 gives you the three per-frame BEDs needed for ribo-seq periodicity QC.

I/O

  • Input: tuple val(meta), path(bed12)
  • Output: tuple val(meta), path("${prefix}.bed") (BED6) + versions topic
  • ext.args: --frame INT (0), --step INT (3), --width INT (1), --keep-duplicates

Container

Wave-built community.wave.seqera.io/library/python_pandas_pyyaml:75514f9f977be607 (with matching community-cr-prod.seqera.io singularity blob).

Test plan

  • nf-test (default, width 3, frame 1, intron fixture at width 1 and 3, keep-duplicates, stub)
  • nf-core modules lint clean (2 Wave-container false positives)
  • pre-commit clean
  • CI green

pinin4fjords and others added 8 commits May 21, 2026 13:09
…pander

Generic helper that walks BED12 block (exon) structure in mRNA order and
emits one BED6 row per in-frame mRNA position. Frame, step and span
width are configurable via `ext.args`; spans crossing a block boundary
are split into one BED row per block so each codon maps back to a
contiguous mRNA region. No upstream module covers this transformation:
bedtools/makewindows only tiles flat genomic spans, and bedtools/getfasta
emits sequence rather than coordinates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Report python as MAJOR.MINOR (matching the env pin) instead of the full
micro version, so the same hash is produced under conda (3.11.15) and
the python:3.11 biocontainer (3.11.10). Drop the recursive
format_yaml_like helper in favour of a direct two-line write now that
the payload is a fixed shape. Tidy the script and meta.yml prose to
drop residual Ribo-seq / ORF framing.
The stub block was emitting the full python micro version via `python
--version`, which drifts between the biocontainer (3.11.10) and conda
(3.11.15) and broke the snapshot under the conda CI shard. Switch to
the same MAJOR.MINOR string the script writes so the hash is stable
across both runtimes.
Pin python (3.12.11), pandas (2.3.0) and pyyaml (6.0.2) in
environment.yml and point the container directive at a Wave-built image
holding the same trio (community.wave.seqera.io/library/python_pandas_pyyaml,
Singularity blob URL via community-cr-prod.seqera.io). Conda and the
container now resolve to identical patch versions, which removes the
need to report only MAJOR.MINOR in versions.yml and lets the script
emit the real platform.python_version() string in a hash-stable way.

Switch the BED12 reader/writer to pandas (read_csv with the UCSC BED12
column names, explode the block fields, project mRNA positions back to
genomic coords via the existing helper) and write versions.yml with
yaml.safe_dump from both the script and the stub so they produce
identical YAML. Drop the bespoke yaml string-writer.
…warn on bad BED12

- Drop runs.sort(); per-codon rows that cross a block boundary now stay
  in mRNA-traversal order on '-' strand records, matching meta.yml.
- Preserve the input BED12 score column instead of hard-coding 0.
- Emit a stderr warning when blockCount disagrees with the parsed
  block fields instead of silently dropping the record.
- Add coverage:
    * frame 1 (non-zero --frame)
    * intron-bearing fixture (real intron gap, both strands), width 1 and 3
    * keep-duplicates (3 single-nt blocks demonstrating dedup vs not)
- Update docstring + meta.yml output description accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pt docstring

Trim the meta.yml module description and the template script docstring
to lead with the use case (codon-level work on spliced features:
ribo-seq P-site counts per codon, periodicity QC, ORF tiling) instead
of recapping the BED12 spec and default args.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nf-core-modules into custom-bed12codonpositions
@pinin4fjords pinin4fjords marked this pull request as ready for review May 21, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant