@alexdobin Thanks for developing STAR. I am performing RNA-seq quantification for Intron Retention (IR) analysis.
-
Index Stage: The STAR index was built using the standard GENCODE v49 basic GTF.
-
Mapping Stage: During the alignment of my samples, I am providing a custom merged.gtf via --sjdbGTFfile. This custom GTF contains "pseudo-transcripts" where intronic regions are explicitly defined as exon features.
STAR --genomeDir ./standard_index \
--readFilesIn R1.fq.gz R2.fq.gz \
--sjdbGTFfile ./custom_merged.gtf \
--quantMode TranscriptomeSAM \
--outSAMtype BAM SortedByCoordinate \
...
Observations:
I noticed that the resulting Aligned.toTranscriptome.out.bam contains the custom pseudo-transcript IDs in the header (confirmed via samtools view -H), and these transcripts receive non-zero expression values in downstream tools like Salmon and eXpress.
(kmapy) [user1@logini04 SRR8518132]$ samtools view -H SRR8518132_Aligned.toTranscriptome.out.bam | grep 'chr1:' | head
@SQ SN:chr1:30014-30291 LN:278
@SQ SN:chr1:30642-31000 LN:359
@SQ SN:chr1:35456-35745 LN:290
@SQ SN:chr1:53347-57622 LN:4276
@SQ SN:chr1:57628-58724 LN:1097
@SQ SN:chr1:58831-62940 LN:4110
@SQ SN:chr1:65548-69061 LN:3514
@SQ SN:chr1:90025-90311 LN:287
@SQ SN:chr1:91604-92115 LN:512
@SQ SN:chr1:108937-112724 LN:3788
grep: write error: Broken pipe
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
(kmapy) [user1@logini04 SRR8518132]$ grep -n 'sjdbGTFfile' /share/home/user1/seager/resource/genome/human/v49/STAR_index/Log.out
4:/share/home/user1/.conda/envs/kmapy/bin/STAR-avx2 --runThreadN 24 --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles GRCh38.primary_assembly.genome.fa --sjdbGTFfile gencode.v49.primary_assembly.basic.annotation.gtf --sjdbOverhang 100
11:sjdbGTFfile gencode.v49.primary_assembly.basic.annotation.gtf ~RE-DEFINED
20:sjdbGTFfile gencode.v49.primary_assembly.basic.annotation.gtf
25:/share/home/user1/.conda/envs/kmapy/bin/STAR-avx2 --runMode genomeGenerate --runThreadN 24 --genomeDir STAR_index --genomeFastaFiles GRCh38.primary_assembly.genome.fa --sjdbGTFfile gencode.v49.primary_assembly.basic.annotation.gtf --sjdbOverhang 100
425:Processing pGe.sjdbGTFfile=gencode.v49.primary_assembly.basic.annotation.gtf, found:
(kmapy) [user1@logini04 SRR8518132]$
(some pse-transcript expression from Salmon, similar in eXpress)
ENST00000831087.1 1142 790.045 0.351636 4.959
chr1:108937-112724 3788 2688.629 0.254859 12.232
chr1:112779-120237 7459 5031.095 0.077971 7.003
ENST00000831089.1 1471 1034.048 0.097285 1.796
ENST00000831092.1 1414 1104.210 0.075621 1.491
chr1:120907-129079 8173 5623.566 0.262852 26.387
ENST00000442987.3 3812 3625.000 0.000000 0.000
Questions:
-
Does STAR 2.7.11b fully override the transcript dictionary from the index with the new --sjdbGTFfile provided during the mapping stage when generating TranscriptomeSAM output?
-
If the custom GTF labels intronic regions as exon features, will STAR correctly project genomic alignments in these regions into the toTranscriptome.out.bam even if these regions were marked as introns in the original index GTF?
-
Is it technically equivalent to use --sjdbGTFfile at the mapping stage vs. rebuilding the index, specifically regarding the accuracy of transcript coordinates in the toTranscriptome.out.bam?
Environment:
Best,
Seager
@alexdobin Thanks for developing STAR. I am performing RNA-seq quantification for Intron Retention (IR) analysis.
Index Stage: The STAR index was built using the standard GENCODE v49 basic GTF.
Mapping Stage: During the alignment of my samples, I am providing a custom merged.gtf via --sjdbGTFfile. This custom GTF contains "pseudo-transcripts" where intronic regions are explicitly defined as exon features.
Observations:
I noticed that the resulting Aligned.toTranscriptome.out.bam contains the custom pseudo-transcript IDs in the header (confirmed via samtools view -H), and these transcripts receive non-zero expression values in downstream tools like Salmon and eXpress.
Questions:
Does STAR 2.7.11b fully override the transcript dictionary from the index with the new --sjdbGTFfile provided during the mapping stage when generating TranscriptomeSAM output?
If the custom GTF labels intronic regions as exon features, will STAR correctly project genomic alignments in these regions into the toTranscriptome.out.bam even if these regions were marked as introns in the original index GTF?
Is it technically equivalent to use --sjdbGTFfile at the mapping stage vs. rebuilding the index, specifically regarding the accuracy of transcript coordinates in the toTranscriptome.out.bam?
Environment:
STAR version: 2.7.11b
OS: Linux (CentOS/Ubuntu)
Downstream tools: Salmon / eXpress
Best,
Seager