Skip to content

Ines-gpm/rna-seq-transcriptomics-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNA-Seq Transcriptomics Pipeline

Overview

This repository contains a complete RNA-Seq analysis pipeline designed to study differential gene expression in enterobacteria under two experimental conditions: control vs treatment.

The workflow processes raw sequencing data through quality control, preprocessing, transcriptome assembly, annotation, and differential expression analysis.


Experimental Design

  • Organism: Enterobacteria
  • Conditions: Control vs Treatment
  • Replicates: 3 biological replicates per condition
  • Sequencing: Paired-end reads (300 bp)

Pipeline Workflow

The analysis consists of the following steps:

1. Quality Control — FastQC

  • Assessment of raw read quality

  • Metrics evaluated:

    • Per-base sequence quality
    • GC content
    • Sequence duplication
    • Adapter contamination

2. Preprocessing — PRINSEQ

  • Removal of low-quality reads:

    • Mean quality < 25
    • Length < 100 bp
  • Trimming of ambiguous bases (Ns)


3. De Novo Transcriptome Assembly — SPAdes

  • Assembly using multiple k-mers:

    • 21, 33, 55, 77, 99, 127
  • Combined reads from all samples


4. Annotation — BLAST

  • Alignment against Enterobacteria gene database

  • Filtering criteria:

    • ≥ 90% identity
  • Functional assignment of transcripts


5. Mapping — Bowtie2 + SAMtools

  • Alignment of reads to assembled transcriptome

  • Conversion and processing:

    • SAM → BAM → sorted BAM → indexed BAM

6. Quantification — Corset

  • Clustering of transcripts
  • Generation of gene-level count matrix

7. Differential Expression — edgeR

  • Statistical analysis in R

  • Outputs:

    • log2 Fold Change (logFC)
    • p-values and FDR
  • Significance threshold: p-value < 0.05


Results Summary

  • ~98% mapping rate across samples
  • 3385 assembled transcripts
  • 6675 expression clusters
  • ~1200 significantly differentially expressed genes
  • Clear separation between conditions in MDS analysis

Repository Structure

.
├── README.md
├── docs/
│   ├── project_report.pdf
│   └── diagram_pipeline.jpeg
├── scripts/
│   ├── 01_fastqc.sh
│   ├── 02_prinseq.sh
│   ├── 03_spades.sh
│   ├── 04_annotation.sh
│   ├── 05_mapping.sh
│   ├── 06_quantification.sh
│   └── 07_differential_expression.R
└── data/
    └── README.md

Full Report

A detailed explanation of methods, commands, and results is available in:

docs/project_report.pdf


Tools Used

  • FastQC
  • PRINSEQ
  • SPAdes
  • BLAST
  • Bowtie2
  • SAMtools
  • Corset
  • edgeR (R / Bioconductor)

References

See full bibliography in docs/project_report.pdf


Author

Inés García de la Peña Marco Computational Omics Analysis

About

End-to-end pipeline from raw FASTQ files to differentially expressed genes, built entirely from command-line tools on Linux. ~98% mapping rate across all samples and 1,192 significantly differentially express

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors