ohe-rs

Ultra-fast one-hot encoding powered by Rust + CUDA, with Python bindings.

Why ohe-rs?

One-hot encoding is a fundamental operation in machine learning pipelines, yet existing implementations in Python (scikit-learn, pandas, numpy) are surprisingly slow on large datasets. They suffer from Python overhead, single-threaded execution, and suboptimal memory access patterns.

ohe-rs solves this by implementing one-hot encoding in Rust with:

Parallel category discovery using rayon + FxHashMap (lock-free, per-thread local maps with global merge)
Zero-copy Python integration via PyO3 + numpy array protocol
Sparse CSR output that uses ~13 bytes/row regardless of cardinality (vs N*K for dense)
Optional CUDA acceleration for GPU-resident data pipelines
Memory-safe operation with upfront estimation and chunked processing for large datasets

Benchmark Results

Machine: 2x Intel Xeon Gold 6542Y (80 cores), 504 GB RAM, NVIDIA L4 (24 GB), Linux. Protocol: 10M rows, warm-up excluded, GC disabled, 7 repeats (median), uint8 output, 80 rayon threads.

End-to-End (category discovery + encoding)

Cardinality (K)	ohe-rs CPU	scikit-learn	Speedup
K = 10	26 ms (387 M rows/s)	381 ms	15x
K = 1,000	21 ms (468 M rows/s)	740 ms	35x
K = 100,000	56 ms (179 M rows/s)	1,310 ms	23x

Transform-Only (K pre-known, no discovery) — full cartesian product

Every combination of H2D (host-to-device) and D2H (device-to-host) transfer benchmarked for both ohe-rs and PyTorch.

CPU:

Method	K=10	K=1,000	K=100,000
ohe-rs CPU	18 ms	18 ms	21 ms
PyTorch sparse COO CPU	29 ms	29 ms	28 ms
sklearn (prefitted)	400 ms	698 ms	1,337 ms

GPU with H2D + D2H (data on host, result on host):

Method	K=10	K=1,000	K=100,000
ohe-rs GPU	48 ms	44 ms	75 ms
PyTorch GPU	75 ms	73 ms	73 ms

GPU pre-loaded input, D2H output (kernel + D2H):

Method	K=10	K=1,000	K=100,000
ohe-rs GPU	25 ms	25 ms	25 ms
PyTorch GPU	66 ms	65 ms	64 ms

GPU all on device — kernel only (no transfer):

Method	K=10	K=1,000	K=100,000
ohe-rs GPU	1.3 ms	1.4 ms	1.4 ms
PyTorch GPU	1.5 ms	1.5 ms	1.5 ms

ohe-rs wins in nearly every scenario. At K=100K with full H2D+D2H, PyTorch edges ahead (73ms vs 75ms) due to lower transfer overhead for COO metadata vs CSR arrays.

PyTorch F.one_hot limitation: allocates a dense int64 tensor (8 bytes/element) before casting. At K=1,000 with 10M rows this requires 80 GB of RAM. ohe-rs sparse uses ~13 bytes/row regardless of K.

Thread Scaling

One-hot encoding is memory-bandwidth bound, not compute-bound. More threads help only up to the point where RAM bandwidth saturates. On our 80-core machine, the sweet spot is 8-16 threads:

Threads	E2E K=10	E2E K=100K	Transform K=10
1	58 ms	273 ms	24 ms
2	38 ms	148 ms	20 ms
4	30 ms	101 ms	22 ms
8	20 ms	70 ms	16 ms
16	20 ms	62 ms	16 ms
32	20 ms	55 ms	20 ms
80	28 ms	64 ms	29 ms

Beyond 16 threads, performance degrades due to cache contention. On typical workstations (4-8 cores), all cores are useful. Use set_threads() to tune:

from ohe_rs import set_threads
set_threads(8)  # recommended for machines with >16 cores

Installation

From source (recommended)

# Clone
git clone https://github.com/genpat-it/ohe-rs.git
cd ohe-rs

# CPU-only build
pip install maturin
maturin develop --release

# With CUDA support (requires CUDA toolkit)
CUDA_ROOT=/usr/local/cuda maturin develop --release --features cuda

Docker (ghcr.io)

# CPU-only
docker pull ghcr.io/genpat-it/ohe-rs:latest
docker run --rm ghcr.io/genpat-it/ohe-rs -c "from ohe_rs import encode_sparse; print('OK')"

# With CUDA support
docker pull ghcr.io/genpat-it/ohe-rs:latest-cuda
docker run --rm --gpus all ghcr.io/genpat-it/ohe-rs:latest-cuda -c "from ohe_rs import gpu_available; print('GPU:', gpu_available())"

Images are automatically built and pushed on each release.

Bioconda

conda install -c bioconda ohe-rs

Available for Python 3.10, 3.11, 3.12, 3.13 on linux-64 and linux-aarch64.

Build requirements

Rust 1.70+
Python 3.9+
numpy >= 1.20
scipy >= 1.7
CUDA toolkit (optional, for GPU support)

Usage

Sparse encoding (recommended)

import numpy as np
from scipy.sparse import csr_matrix
from ohe_rs import encode_sparse

data = np.array([0, 1, 2, 0, 1, 2, 3], dtype=np.int64)
values, indices, indptr, n_categories = encode_sparse(data)

# Build scipy sparse matrix
matrix = csr_matrix((values, indices, indptr), shape=(len(data), n_categories))
print(matrix.toarray())
# [[1 0 0 0]
#  [0 1 0 0]
#  [0 0 1 0]
#  [1 0 0 0]
#  [0 1 0 0]
#  [0 0 1 0]
#  [0 0 0 1]]

Dense encoding

from ohe_rs import encode_dense

data = np.array([0, 1, 2, 0], dtype=np.int64)
matrix = encode_dense(data)  # np.ndarray, shape (4, 3), dtype uint8

String input

from ohe_rs import encode_strings_sparse

strings = ["cat", "dog", "cat", "bird", "dog"]
values, indices, indptr, categories, n_cats = encode_strings_sparse(strings)
print(categories)  # ['cat', 'dog', 'bird']

Multi-column encoding (cgMLST / allele profiles)

For datasets with many categorical columns (e.g. cgMLST allele profiles), encode_multi_sparse encodes all columns in a single Rust call, avoiding Python loop overhead.

import numpy as np
from scipy.sparse import csr_matrix
from ohe_rs import encode_multi_sparse

# cgMLST-like matrix: 10K samples x 8K loci, each cell is an allele ID
profiles = np.random.randint(0, 300, size=(10_000, 8_000), dtype=np.int64)

# Single call — encodes all columns in parallel
values, indices, indptr, total_cols, per_col_sizes = encode_multi_sparse(profiles)

# Build scipy sparse matrix (rows=samples, cols=concatenated one-hot of all loci)
matrix = csr_matrix((values, indices, indptr), shape=(10_000, total_cols))
# matrix.shape = (10000, ~2.4M)  — each row has exactly 8000 non-zeros

Performance (10K samples x 8K loci, ~50-500 alleles per locus):

Method	Time	Speedup
ohe-rs encode_multi_sparse	724 ms	12x
ohe-rs per-column Python loop	2,491 ms	3.5x
sklearn per-column	8,618 ms	baseline

Memory usage:

	Size
Input matrix (int64)	640 MB
Sparse output (CSR)	400 MB
Dense equivalent (uint8)	21.8 GB

Sparse uses 1.8% of the memory that dense would require. The output matrix has shape (10,000 x 2,178,687) with 80M non-zeros — each row has exactly 8,000 ones (one per locus).

Cached encoder (fit once, transform many)

For repeated encoding against the same schema (e.g. new samples arriving daily):

from ohe_rs import MultiEncoder

# Fit once on reference dataset (builds category maps)
encoder = MultiEncoder.fit(reference_profiles)  # 251 ms

# Transform new samples instantly (skip discovery)
result = encoder.transform(new_profiles)  # 27 ms for 500 samples

# Or fit + transform in one call
encoder, values, indices, indptr, total_k, col_sizes = MultiEncoder.fit_transform(profiles)

# Inspect
print(encoder.n_loci)                # 8000
print(encoder.total_columns)         # 2,178,687
print(encoder.categories_per_column) # [124, 89, 201, ...]

Operation	Time
`fit` (10K reference, one-time)	251 ms
`transform` (500 new samples)	27 ms
`transform` (10K samples)	465 ms
`encode_multi_sparse` (no cache)	665 ms

Memory estimation

from ohe_rs import estimate_memory

data = np.random.randint(0, 100_000, size=10_000_000, dtype=np.int64)
dense_bytes, sparse_bytes = estimate_memory(data)
print(f"Dense: {dense_bytes / 1e9:.1f} GB")   # Dense: 1000.0 GB
print(f"Sparse: {sparse_bytes / 1e6:.1f} MB")  # Sparse: 130.0 MB

Memory-safe dense encoding

# Automatically processes in chunks if needed
matrix = encode_dense(data, max_memory_mb=512)

GPU acceleration

from ohe_rs import gpu_available

if gpu_available():
    from ohe_rs import gpu_encode_sparse, gpu_encode_dense

    values, indices, indptr, n_cats = gpu_encode_sparse(data)
    dense_matrix = gpu_encode_dense(data)  # for small K

Thread control

from ohe_rs import set_threads
set_threads(4)  # Limit to 4 threads

Architecture

Input (Python numpy array)
         |
         v
+----------------------------+
|  Rust Core (PyO3 bindings) |
|                            |
|  1. Category Discovery     |
|     rayon parallel chunks  |
|     FxHashMap per-thread   |
|     + sequential merge     |
|                            |
|  2. Encoding               |
|     CPU: parallel write    |
|     GPU: CUDA kernel       |
|                            |
|  3. Output                 |
|     Sparse CSR (zero-copy) |
|     Dense ndarray          |
+----------------------------+
         |
         v
scipy.sparse.csr_matrix / np.ndarray

Why CPU beats GPU here

One-hot encoding is memory-bound, not compute-bound. Each element requires:

1 hash lookup (category mapping)
1 memory write (set the bit)

The GPU kernel itself runs in microseconds, but the host-to-device transfer of N int64 values (~80 MB for 10M rows) dominates the total time. GPU wins when:

Data is already on the GPU (e.g., in a cuML/PyTorch pipeline)
You combine OHE with other GPU operations, amortizing the transfer cost

Development

# Build
cargo build --release

# Tests
cargo test

# Build with CUDA
CUDA_ROOT=/usr/local/cuda cargo build --release --features cuda

# Python development install
maturin develop --release

# Run benchmarks
python benchmark.py

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
benches		benches
python/ohe_rs		python/ohe_rs
recipes/bioconda		recipes/bioconda
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.cuda		Dockerfile.cuda
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
benchmark_threads.py		benchmark_threads.py
build.rs		build.rs
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ohe-rs

Why ohe-rs?

Benchmark Results

End-to-End (category discovery + encoding)

Transform-Only (K pre-known, no discovery) — full cartesian product

Thread Scaling

Installation

From source (recommended)

Docker (ghcr.io)

Bioconda

Build requirements

Usage

Sparse encoding (recommended)

Dense encoding

String input

Multi-column encoding (cgMLST / allele profiles)

Cached encoder (fit once, transform many)

Memory estimation

Memory-safe dense encoding

GPU acceleration

Thread control

Architecture

Why CPU beats GPU here

Development

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ohe-rs

Why ohe-rs?

Benchmark Results

End-to-End (category discovery + encoding)

Transform-Only (K pre-known, no discovery) — full cartesian product

Thread Scaling

Installation

From source (recommended)

Docker (ghcr.io)

Bioconda

Build requirements

Usage

Sparse encoding (recommended)

Dense encoding

String input

Multi-column encoding (cgMLST / allele profiles)

Cached encoder (fit once, transform many)

Memory estimation

Memory-safe dense encoding

GPU acceleration

Thread control

Architecture

Why CPU beats GPU here

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages