HALRA (High-performance ALRA) is a Python implementation of the ALRA algorithm for imputing missing values in single-cell RNA-seq data [1]. It is designed to operate efficiently on sparse matrices and scale to large datasets by preserving sparsity throughout the pipeline wherever possible.
HALRA performs:
- Low-rank matrix reconstruction via randomized SVD
- Gene-wise thresholding of reconstructed values
- Per-gene rescaling to match observed statistics
- Restoration of observed (nonzero) values
The goal is to denoise and impute dropout values while preserving biological signal.
HALRA supports two input types:
.Xshould contain a dense NumPy array or a SciPy sparse matrix (CSR/CSC).obs_namesand.var_namesare used as cell and gene labels
matrix: NumPy ndarray or SciPy sparse matrix (cell x gene)cells: list/array of cell names (length = n_rows)genes: list/array of gene names (length = n_cols)
HALRA requires log-normalized count data, so either log normalize your data first or pass normalize=True to the halra function when imputing.
HALRA can be installed as a pip package but requires Python >= 3.10. Example:
conda create -n halra_env python=3.10
conda activate halra_env
pip install halraimport anndata as ad
from halra import halra
# Load your AnnData object
adata = ad.read_h5ad("anndata.h5ad")
# Run HALRA (this assumes .X is not already normalized)
adata_imputed = halra(adata, normalize=True)
# Result:
# adata_imputed.X now contains imputed values
# All metadata (.obs, .var, etc.) is preserved (filtered if needed)import os
import pandas as pd
from scipy.io import mmread
from halra import halra
# Load 10x files
mtx_dir = "/path/to/dir"
matrix = mmread(os.path.join(mtx_dir, "matrix.mtx")).T
features = pd.read_csv(os.path.join(mtx_dir, "features.tsv"), sep="\t", header=None, usecols=[0])
barcodes = pd.read_csv(os.path.join(mtx_dir, "barcodes.tsv"), sep="\t", header=None)
# Run HALRA
imputed_matrix, cells, genes = halra(matrix, barcodes, features, normalize=True)
# Result:
# imputed_matrix contains imputed values
# cells and genes contain the filtered cell/gene labelsHALRA depends on:
- numpy
- scipy
- scikit-learn (for randomized SVD)
- anndata (>=0.10)
- Reconstruction step is dense (SVD-based), which may limit scalability for extremely large datasets (>1M cells)
- Distributed and HPC-oriented implementations of HALRA are under active development
and can be found in the
experimental/directory. These are not yet part of the stable package API.
[1] Linderman, G. C. et al. Zero-preserving imputation of single-cell RNA-seq data. Nat Commun 13, (2022).
This project is licensed under the MIT License - see the LICENSE file for details.