✨ feat: add Python deep learning API with PyTorch and Lightning integ… by cauliyang · Pull Request #85 · cauliyang/DeepBioP

cauliyang · 2025-11-08T05:37:58Z

Implement comprehensive Python API for deep learning workflows with biological sequence data, including streaming datasets, transforms, and ML framework integrations.

Core Features:

Stream datasets (FASTQ/FASTA/BAM) with PyTorch DataLoader compatibility
Map-style dataset access via getitem for random access
Batch generation with collation and padding strategies
Zero-copy NumPy array integration for efficient memory usage

Dataset Implementations:

FastqStreamDataset, FastaStreamDataset, BamStreamDataset
Support for compressed files (gzip, bgzip)
Metadata tracking and dataset statistics
Pickle support for multiprocessing (num_workers > 0)

Transforms & Augmentation:

OneHotEncoder, IntegerEncoder, KmerEncoder for sequence encoding
ReverseComplement, Mutator, Sampler for data augmentation
QualityFilter, LengthFilter for data filtering
QualitySimulator for synthetic quality score generation
Transform composition with Compose and FilterCompose

ML Framework Integration:

PyTorch DataLoader with multiprocessing support
PyTorch Lightning BiologicalDataModule
Hugging Face Transformers Trainer compatibility
DistributedSampler support for distributed training

Documentation & Examples:

Comprehensive API reference documentation
Quickstart guide with common use cases
Performance optimization guide
Troubleshooting documentation
Jupyter notebook examples for PyTorch and Lightning

Testing:

180 passing tests across all modules
Performance benchmarks for throughput and memory
Integration tests for PyTorch, Lightning, and Transformers
Transform and encoding correctness tests

All tests pass with pre-commit hooks verified.

…ration Implement comprehensive Python API for deep learning workflows with biological sequence data, including streaming datasets, transforms, and ML framework integrations. Core Features: - Stream datasets (FASTQ/FASTA/BAM) with PyTorch DataLoader compatibility - Map-style dataset access via __getitem__ for random access - Batch generation with collation and padding strategies - Zero-copy NumPy array integration for efficient memory usage Dataset Implementations: - FastqStreamDataset, FastaStreamDataset, BamStreamDataset - Support for compressed files (gzip, bgzip) - Metadata tracking and dataset statistics - Pickle support for multiprocessing (num_workers > 0) Transforms & Augmentation: - OneHotEncoder, IntegerEncoder, KmerEncoder for sequence encoding - ReverseComplement, Mutator, Sampler for data augmentation - QualityFilter, LengthFilter for data filtering - QualitySimulator for synthetic quality score generation - Transform composition with Compose and FilterCompose ML Framework Integration: - PyTorch DataLoader with multiprocessing support - PyTorch Lightning BiologicalDataModule - Hugging Face Transformers Trainer compatibility - DistributedSampler support for distributed training Documentation & Examples: - Comprehensive API reference documentation - Quickstart guide with common use cases - Performance optimization guide - Troubleshooting documentation - Jupyter notebook examples for PyTorch and Lightning Testing: - 180 passing tests across all modules - Performance benchmarks for throughput and memory - Integration tests for PyTorch, Lightning, and Transformers - Transform and encoding correctness tests All tests pass with pre-commit hooks verified.

Remove placeholder test files and enable all previously ignored dataset tests to achieve 100% test coverage with zero ignored tests. Changes: - Remove placeholder test files in deepbiop-fq that contained only commented code - Deleted test_compression.rs (234 lines of unused placeholder code) - Deleted test_dataset.rs (175 lines of unused placeholder code) - Enable and fix 3 FASTA dataset tests in deepbiop-fa - Removed #[ignore] attributes from dataset creation, iteration, and multiple iteration tests - All tests now run successfully with existing test data - Enable and fix 4 BAM dataset tests in deepbiop-bam - Removed #[ignore] attributes from all dataset tests - Updated file paths from test.bam to test_chimric_reads.bam - Tests validate dataset creation, iteration, multiple iterations, and multi-threaded operation - Fix test imports and feature gates - Properly gate parquet-related test imports with #[cfg(feature = "cache")] - Move imports inside test functions to avoid unused import warnings - Ensure EncoderOptionBuilder is imported where needed Test Results: - Rust: 206 total tests passing, 0 ignored (was 7 ignored) - deepbiop-bam: 8 passed, 0 ignored (was 4 ignored) - deepbiop-core: 43 passed, 0 ignored - deepbiop-fa: 31 passed, 0 ignored (was 3 ignored) - deepbiop-fq: 100 passed, 0 ignored - deepbiop-utils: 24 passed, 0 ignored - Python: 180 passed, 18 skipped (optional dependencies) All pre-commit hooks passing.

Add test-python.yml workflow for Python CI with three jobs: **test job** - Multi-platform and multi-version testing - Platforms: Ubuntu, macOS, Windows - Python versions: 3.10, 3.11, 3.12 - Uses uv for fast dependency management - Builds Rust extension with maturin - Runs full pytest suite - Generates coverage report (Linux + Python 3.10 only) - Uploads coverage to Codecov **lint job** - Code quality checks - Runs ruff for code formatting and linting - Checks docstring coverage with interrogate - Ensures code meets quality standards **minimum-versions job** - Compatibility verification - Tests with Python 3.10 (minimum supported version) - Verifies version requirements match pyproject.toml **Features:** - Concurrent execution with automatic cancellation - Caching for both Rust and Python dependencies - Fail-fast disabled to see all failures - Minimal debug symbols for faster builds **Related changes:** - Renamed test.yml to test-rust.yml for clarity - Updated pyproject.toml with correct requires-python (>=3.10)

Refine test-python.yml workflow: - Add -ls flag to pytest for better test output visibility - Remove redundant pytest-cov installation step - Remove minimum-versions job (redundant with test matrix) - Fix YAML formatting (spacing in comment) Update Python dependencies: - Add pytest-cov>=7.0.0 to dev dependencies - Update uv.lock with coverage 7.11.1 package This streamlines the CI workflow while ensuring coverage reporting works correctly across all test runs.

cauliyang added 4 commits November 7, 2025 23:36

cauliyang closed this Nov 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ feat: add Python deep learning API with PyTorch and Lightning integ…#85

✨ feat: add Python deep learning API with PyTorch and Lightning integ…#85
cauliyang wants to merge 4 commits into
001-biodata-dl-libfrom
003-python-dl-optimization

cauliyang commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cauliyang commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant