✨ feat: add Python deep learning API with PyTorch and Lightning integ…#85
Closed
cauliyang wants to merge 4 commits into
Closed
✨ feat: add Python deep learning API with PyTorch and Lightning integ…#85cauliyang wants to merge 4 commits into
cauliyang wants to merge 4 commits into
Conversation
…ration Implement comprehensive Python API for deep learning workflows with biological sequence data, including streaming datasets, transforms, and ML framework integrations. Core Features: - Stream datasets (FASTQ/FASTA/BAM) with PyTorch DataLoader compatibility - Map-style dataset access via __getitem__ for random access - Batch generation with collation and padding strategies - Zero-copy NumPy array integration for efficient memory usage Dataset Implementations: - FastqStreamDataset, FastaStreamDataset, BamStreamDataset - Support for compressed files (gzip, bgzip) - Metadata tracking and dataset statistics - Pickle support for multiprocessing (num_workers > 0) Transforms & Augmentation: - OneHotEncoder, IntegerEncoder, KmerEncoder for sequence encoding - ReverseComplement, Mutator, Sampler for data augmentation - QualityFilter, LengthFilter for data filtering - QualitySimulator for synthetic quality score generation - Transform composition with Compose and FilterCompose ML Framework Integration: - PyTorch DataLoader with multiprocessing support - PyTorch Lightning BiologicalDataModule - Hugging Face Transformers Trainer compatibility - DistributedSampler support for distributed training Documentation & Examples: - Comprehensive API reference documentation - Quickstart guide with common use cases - Performance optimization guide - Troubleshooting documentation - Jupyter notebook examples for PyTorch and Lightning Testing: - 180 passing tests across all modules - Performance benchmarks for throughput and memory - Integration tests for PyTorch, Lightning, and Transformers - Transform and encoding correctness tests All tests pass with pre-commit hooks verified.
Remove placeholder test files and enable all previously ignored dataset tests
to achieve 100% test coverage with zero ignored tests.
Changes:
- Remove placeholder test files in deepbiop-fq that contained only commented code
- Deleted test_compression.rs (234 lines of unused placeholder code)
- Deleted test_dataset.rs (175 lines of unused placeholder code)
- Enable and fix 3 FASTA dataset tests in deepbiop-fa
- Removed #[ignore] attributes from dataset creation, iteration, and
multiple iteration tests
- All tests now run successfully with existing test data
- Enable and fix 4 BAM dataset tests in deepbiop-bam
- Removed #[ignore] attributes from all dataset tests
- Updated file paths from test.bam to test_chimric_reads.bam
- Tests validate dataset creation, iteration, multiple iterations,
and multi-threaded operation
- Fix test imports and feature gates
- Properly gate parquet-related test imports with #[cfg(feature = "cache")]
- Move imports inside test functions to avoid unused import warnings
- Ensure EncoderOptionBuilder is imported where needed
Test Results:
- Rust: 206 total tests passing, 0 ignored (was 7 ignored)
- deepbiop-bam: 8 passed, 0 ignored (was 4 ignored)
- deepbiop-core: 43 passed, 0 ignored
- deepbiop-fa: 31 passed, 0 ignored (was 3 ignored)
- deepbiop-fq: 100 passed, 0 ignored
- deepbiop-utils: 24 passed, 0 ignored
- Python: 180 passed, 18 skipped (optional dependencies)
All pre-commit hooks passing.
Add test-python.yml workflow for Python CI with three jobs: **test job** - Multi-platform and multi-version testing - Platforms: Ubuntu, macOS, Windows - Python versions: 3.10, 3.11, 3.12 - Uses uv for fast dependency management - Builds Rust extension with maturin - Runs full pytest suite - Generates coverage report (Linux + Python 3.10 only) - Uploads coverage to Codecov **lint job** - Code quality checks - Runs ruff for code formatting and linting - Checks docstring coverage with interrogate - Ensures code meets quality standards **minimum-versions job** - Compatibility verification - Tests with Python 3.10 (minimum supported version) - Verifies version requirements match pyproject.toml **Features:** - Concurrent execution with automatic cancellation - Caching for both Rust and Python dependencies - Fail-fast disabled to see all failures - Minimal debug symbols for faster builds **Related changes:** - Renamed test.yml to test-rust.yml for clarity - Updated pyproject.toml with correct requires-python (>=3.10)
Refine test-python.yml workflow: - Add -ls flag to pytest for better test output visibility - Remove redundant pytest-cov installation step - Remove minimum-versions job (redundant with test matrix) - Fix YAML formatting (spacing in comment) Update Python dependencies: - Add pytest-cov>=7.0.0 to dev dependencies - Update uv.lock with coverage 7.11.1 package This streamlines the CI workflow while ensuring coverage reporting works correctly across all test runs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement comprehensive Python API for deep learning workflows with biological sequence data, including streaming datasets, transforms, and ML framework integrations.
Core Features:
Dataset Implementations:
Transforms & Augmentation:
ML Framework Integration:
Documentation & Examples:
Testing:
All tests pass with pre-commit hooks verified.