Pipeline for identifying and evaluating eLncRNA-mediated regulatory axes using TCGA, GTEx, eQTL, clinical, and genetic evidence.
This repository is organized as a stepwise Snakemake pipeline. Each numbered directory contains one analysis stage and its own README.md.
| Step | Directory | Purpose |
|---|---|---|
| 01 | 01_prepare_TCGA_data/ |
Prepare TCGA expression, genotype, clinical, and annotation files. |
| 02 | 02_prepare_GTEx_data/ |
Prepare GTEx tissue expression, genotype, covariate, and annotation files. |
| 03 | 03_eQTLs_prediction/ |
Run cis- and trans-eQTL prediction. |
| 04 | 04_get_trios/ |
Build candidate lncRNA-SNP-gene regulatory trios. |
| 05 | 05_clinical_evidences/ |
Evaluate clinical and survival evidence. |
| 06 | 06_genetic_evidences/ |
Evaluate MR, coloc, SMR, PredictDB, and SPrediXcan evidence. |
| 07 | 07_GWAS_enrichment_analyses/ |
Run GWAS enrichment analyses using eQTL-derived annotations. |
| 08 | 08_add_druggability_and_STRING_IAS/ |
Annotate mediation eGenes with druggability and STRING IAS evidence. |
Selected processed results are available on Zenodo: 10.5281/zenodo.17605304.
Run each step from its own directory. The exact Snakefile names and aggregate targets are documented in each numbered directory's README.md.
cd 01_prepare_TCGA_data
snakemake -s TCGA_SNPs_data.smk --use-apptainer --cores 8
snakemake -s TCGA_gene_expression_data.smk --use-apptainer --cores 8For steps with a prepare_input.py script, run it before Snakemake:
python prepare_input.py
snakemake -s workflow.smk --use-apptainer --cores 8The prepare_input.py scripts use hard links to avoid duplicating large files.
Manual inputs are mainly reference files, TCGA/GTEx source data, and GWAS summary statistics. Check the README in each numbered directory for the required paths before running that step.
Some steps also require large reference resources to be prepared with step-specific download scripts before running Snakemake.
Each step has its own config.yaml. Update sample lists, tissue lists, TCGA cancer types, and GWAS IDs there before running the workflow.
- The workflows are designed to run with Apptainer/Singularity containers through Snakemake.
- Hard links require source and destination files to be on the same filesystem.
- Run the pipeline in numerical order unless you already have the required intermediate inputs.
- Large protected datasets such as TCGA/GTEx genotype data may need to be downloaded manually.
- Large public reference files should be downloaded or linked according to the step-specific README before running the corresponding workflow.