Skip to content

ayyucedemirbas/scAnalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

scAnalyzer: A Single-Cell Analysis Toolkit

A Python toolkit for single-cell RNA sequencing (scRNA-seq) analysis.

🚧 Warning this project is under heavy development and not ready for production. ABI changes can happen frequently until reach stable version 🚧

GitHub Black isort

Package version

scAnalyzer is an integrated toolkit designed for scalable and memory-efficient single-cell RNA sequencing (scRNA-seq) data analysis. Built around a custom, highly optimized SingleCellDataset core, it seamlessly bridges foundational preprocessing with advanced downstream analyses, including dropout imputation, trajectory inference, batch correction, and interactive 3D visualizations.

✨ Key Features

  • πŸ“¦ Memory-Efficient Core: Custom SingleCellDataset supporting sparse matrices (CSR/CSC) and HDF5 (.h5ad) I/O operations natively.
  • 🧹 Robust Preprocessing & Normalization: Automated QC, MAD-based outlier detection, doublet prediction (via Scrublet), and cell-cycle scoring. Includes advanced normalization techniques such as scran-like pooling and sctransform-like regression.
  • 🩹 Dropout Imputation: Dedicated module to handle missing data and technical dropouts using Weighted Neighborhood Imputation with Dropout Detection (WNID), kNN-smoothing, and diffusion-based algorithms.
  • πŸ”„ Batch Correction: Built-in support for multiple integration algorithms including Harmony, ComBat, and Mutual Nearest Neighbors (MNN).
  • πŸ—ΊοΈ Dimensionality Reduction & Clustering: PCA, UMAP, t-SNE, PHATE, and Diffusion Maps. Supports graph-based (Leiden, Louvain) and distance-based clustering (K-Means, DBSCAN, Hierarchical, Spectral).
  • πŸ“Š Differential Expression & Enrichment: Highly vectorized, ultra-fast marker gene identification (t-test, Wilcoxon) and Gene Set Enrichment Analysis (Hypergeometric, Fisher's Exact, GSEA).
  • πŸ›€οΈ Trajectory Inference: Dynamic cellular lineage tracking using Diffusion Pseudotime (DPT) with automated root selection and branch detection.
  • 🎨 Interactive Visualizations: Publication-ready static plots (Matplotlib/Seaborn) and dynamic, browser-based Plotly visualizations (Interactive UMAP/PCA, 3D embeddings, violins, and heatmaps).

πŸš€ Installation

Install the package directly from PyPI:

pip install scAnalysis

For interactive visualizations, ensure plotly is installed. For graph-based clustering, leidenalg, louvain, and igraph are required. For advanced embeddings, umap-learn and phate are optionally supported.

πŸ’‘ Quick Start

Here is a minimal example demonstrating a comprehensive scRNA-seq workflow, from data loading to imputation and visualization:

import scAnalysis as sca

1. Load Data

adata = sca.sc_io.read_10x_mtx('data/filtered_gene_bc_matrices/hg19')
adata.var.index = sca.sc_io._make_unique(adata.var.index.values)

2. QC, Filtering & Doublet Detection

sca.preprocessing.calculate_qc_metrics(adata, qc_vars=['MT-'])
sca.quality_control.scrublet(adata)

# Filter out predicted doublets and low-quality cells
mask_singlets = ~adata.obs['predicted_doublet'].astype(bool)
adata = adata[mask_singlets, :]
adata = sca.preprocessing.filter_cells(adata, min_genes=200, max_pct_mito=5.0)
adata = sca.preprocessing.filter_genes(adata, min_cells=3)

3. Normalization, Imputation & Feature Selection

# Choose normalization: normalize_total, normalize_scran_pooling, or normalize_sctransform
sca.preprocessing.normalize_total(adata, target_sum=1e4)
sca.preprocessing.log1p(adata)

# Recover technical dropouts via WNID imputation
sca.imputation.impute_wnid(adata, k=3, dropout_thresh=0.9, n_pcs=30)

sca.cell_cycle.score_cell_cycle(adata, organism="human")
sca.preprocessing.highly_variable_genes(adata, n_top_genes=2000)
adata.raw = adata.copy()
sca.preprocessing.scale(adata, max_value=10)

4. Dimensionality Reduction & Batch Correction

sca.dimensionality.run_pca(adata, n_components=50)

# Optional: Correct batch effects (e.g., using Harmony)
# sca.batch_correction.harmony_integrate(adata, batch_key='batch_col', basis='X_pca')

sca.dimensionality.neighbors(adata, n_neighbors=10, n_pcs=40)
sca.dimensionality.run_umap(adata, min_dist=0.3)

5. Clustering, Trajectory & Differential Expression

sca.clustering.cluster_leiden(adata, resolution=0.5, key_added='leiden')

# Infer Cellular Trajectory
root_idx = sca.trajectory.select_root_cell(adata, cluster_key='leiden', root_cluster='0', strategy='extreme')
sca.trajectory.diffusion_pseudotime(adata, root_cell=root_idx)

# Find Markers
sca.differential.rank_genes_groups(adata, groupby='leiden', method='t-test')
cluster0_markers = sca.differential.get_marker_genes(adata, group='0', pval_cutoff=0.05, lfc_cutoff=0.5)

6. Visualization

# Static Plots
sca.visualization.plot_umap(adata, color='leiden', save='umap_clusters.png')
sca.visualization.plot_dotplot(adata, var_names=['CD3E', 'MS4A1', 'CD14'], groupby='leiden')

# Interactive Browser-based Plots
sca.interactive_viz.interactive_embedding(adata, basis='X_umap', color='leiden', hover_data=['dpt_pseudotime', 'phase'])
sca.interactive_viz.interactive_3d_embedding(adata, basis='X_pca', color='leiden')

πŸ—οΈ Architecture & Modules

The framework is highly modular, allowing you to use only the components you need:

  • scAnalysis.core: Base SingleCellDataset data structure supporting dense and sparse memory-efficient representations.
  • scAnalysis.preprocessing: QC metrics, normalization (scran, sctransform, standard scaling), and HVG selection.
  • scAnalysis.quality_control: Scrublet doublet detection and MAD-based outlier filtering.
  • scAnalysis.imputation: WNID, kNN-smooth, and Diffusion imputation for dropout recovery.
  • scAnalysis.batch_correction: Integration methods via Harmony, ComBat, and MNN.
  • scAnalysis.cell_cycle: S and G2M phase scoring and phase regression.
  • scAnalysis.dimensionality: PCA, UMAP, t-SNE, DiffMap, PHATE, and nearest-neighbor graphs.
  • scAnalysis.clustering: K-Means, Leiden, Louvain, Spectral, DBSCAN, and Hierarchical clustering.
  • scAnalysis.differential: Highly vectorized statistics for marker discovery.
  • scAnalysis.enrichment: Gene set scoring, MSigDB integration, hypergeometric/Fisher enrichment, and GSEA.
  • scAnalysis.trajectory: Root cell selection, Diffusion Pseudotime (DPT), branching, and gene trend modeling.
  • scAnalysis.visualization: Static, publication-ready plotting (Violin, Dotplot, Heatmap, Volcano, etc.).
  • scAnalysis.interactive_viz: Plotly-powered interactive 2D/3D embeddings, violins, and heatmaps.
  • scAnalysis.sc_io: Native read/write support for 10x MTX, CSV, TSV, and .h5ad formats.

πŸ§ͺ Testing

The package includes a comprehensive suite of unit tests checking matrix sparsity integrity, statistical functions, and algorithmic accuracy. To run the tests locally:

python -m unittest discover scAnalysis/ -p "test_*.py"

🀝 Contributing

Contributions are welcome! If you find a bug or want to suggest a new feature, please open an issue or submit a pull request.

πŸ€– Future Enhancements / To-Do List

  • Implement Imputation Module (Dropout Handling)
    • Successfully integrated WNID, kNN-smoothing, and Diffusion algorithms.
  • Add Automated Cell Type Annotation & Projection
    • Context: Currently, cell type assignment relies on a manual, marker-based approach using gene set scoring (enrichment.py).
    • Task: Implement automated, classifier-based annotation tools that can predict cell types directly from reference datasets.
    • References: Consider integrating projection algorithms like scmap or regularized regression classifiers like Garnett.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors