Skip to content

Clarification on --quantMode TranscriptomeSAM behavior when using a custom GTF during mapping that differs from the index GTF #2679

@Ci-TJ

Description

@Ci-TJ

@alexdobin Thanks for developing STAR. I am performing RNA-seq quantification for Intron Retention (IR) analysis.

  • Index Stage: The STAR index was built using the standard GENCODE v49 basic GTF.

  • Mapping Stage: During the alignment of my samples, I am providing a custom merged.gtf via --sjdbGTFfile. This custom GTF contains "pseudo-transcripts" where intronic regions are explicitly defined as exon features.

STAR --genomeDir ./standard_index \
     --readFilesIn R1.fq.gz R2.fq.gz \
     --sjdbGTFfile ./custom_merged.gtf \
     --quantMode TranscriptomeSAM \
     --outSAMtype BAM SortedByCoordinate \
     ...

Observations:

I noticed that the resulting Aligned.toTranscriptome.out.bam contains the custom pseudo-transcript IDs in the header (confirmed via samtools view -H), and these transcripts receive non-zero expression values in downstream tools like Salmon and eXpress.

(kmapy) [user1@logini04 SRR8518132]$ samtools view -H SRR8518132_Aligned.toTranscriptome.out.bam | grep 'chr1:' | head
@SQ     SN:chr1:30014-30291     LN:278
@SQ     SN:chr1:30642-31000     LN:359
@SQ     SN:chr1:35456-35745     LN:290
@SQ     SN:chr1:53347-57622     LN:4276
@SQ     SN:chr1:57628-58724     LN:1097
@SQ     SN:chr1:58831-62940     LN:4110
@SQ     SN:chr1:65548-69061     LN:3514
@SQ     SN:chr1:90025-90311     LN:287
@SQ     SN:chr1:91604-92115     LN:512
@SQ     SN:chr1:108937-112724   LN:3788
grep: write error: Broken pipe
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
(kmapy) [user1@logini04 SRR8518132]$ grep -n 'sjdbGTFfile' /share/home/user1/seager/resource/genome/human/v49/STAR_index/Log.out
4:/share/home/user1/.conda/envs/kmapy/bin/STAR-avx2 --runThreadN 24 --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles GRCh38.primary_assembly.genome.fa --sjdbGTFfile gencode.v49.primary_assembly.basic.annotation.gtf --sjdbOverhang 100
11:sjdbGTFfile                   gencode.v49.primary_assembly.basic.annotation.gtf     ~RE-DEFINED
20:sjdbGTFfile                       gencode.v49.primary_assembly.basic.annotation.gtf
25:/share/home/user1/.conda/envs/kmapy/bin/STAR-avx2   --runMode genomeGenerate      --runThreadN 24   --genomeDir STAR_index   --genomeFastaFiles GRCh38.primary_assembly.genome.fa      --sjdbGTFfile gencode.v49.primary_assembly.basic.annotation.gtf   --sjdbOverhang 100
425:Processing pGe.sjdbGTFfile=gencode.v49.primary_assembly.basic.annotation.gtf, found:
(kmapy) [user1@logini04 SRR8518132]$ 

(some pse-transcript expression from Salmon, similar in eXpress)
ENST00000831087.1       1142    790.045 0.351636        4.959
chr1:108937-112724      3788    2688.629        0.254859        12.232
chr1:112779-120237      7459    5031.095        0.077971        7.003
ENST00000831089.1       1471    1034.048        0.097285        1.796
ENST00000831092.1       1414    1104.210        0.075621        1.491
chr1:120907-129079      8173    5623.566        0.262852        26.387
ENST00000442987.3       3812    3625.000        0.000000        0.000

Questions:

  • Does STAR 2.7.11b fully override the transcript dictionary from the index with the new --sjdbGTFfile provided during the mapping stage when generating TranscriptomeSAM output?

  • If the custom GTF labels intronic regions as exon features, will STAR correctly project genomic alignments in these regions into the toTranscriptome.out.bam even if these regions were marked as introns in the original index GTF?

  • Is it technically equivalent to use --sjdbGTFfile at the mapping stage vs. rebuilding the index, specifically regarding the accuracy of transcript coordinates in the toTranscriptome.out.bam?

Environment:

  • STAR version: 2.7.11b

  • OS: Linux (CentOS/Ubuntu)

  • Downstream tools: Salmon / eXpress

Best,
Seager

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions