Skip to content

✨ feat: add Python deep learning API with PyTorch and Lightning integ…#85

Closed
cauliyang wants to merge 4 commits into
001-biodata-dl-libfrom
003-python-dl-optimization
Closed

✨ feat: add Python deep learning API with PyTorch and Lightning integ…#85
cauliyang wants to merge 4 commits into
001-biodata-dl-libfrom
003-python-dl-optimization

Conversation

@cauliyang

Copy link
Copy Markdown
Owner

Implement comprehensive Python API for deep learning workflows with biological sequence data, including streaming datasets, transforms, and ML framework integrations.

Core Features:

  • Stream datasets (FASTQ/FASTA/BAM) with PyTorch DataLoader compatibility
  • Map-style dataset access via getitem for random access
  • Batch generation with collation and padding strategies
  • Zero-copy NumPy array integration for efficient memory usage

Dataset Implementations:

  • FastqStreamDataset, FastaStreamDataset, BamStreamDataset
  • Support for compressed files (gzip, bgzip)
  • Metadata tracking and dataset statistics
  • Pickle support for multiprocessing (num_workers > 0)

Transforms & Augmentation:

  • OneHotEncoder, IntegerEncoder, KmerEncoder for sequence encoding
  • ReverseComplement, Mutator, Sampler for data augmentation
  • QualityFilter, LengthFilter for data filtering
  • QualitySimulator for synthetic quality score generation
  • Transform composition with Compose and FilterCompose

ML Framework Integration:

  • PyTorch DataLoader with multiprocessing support
  • PyTorch Lightning BiologicalDataModule
  • Hugging Face Transformers Trainer compatibility
  • DistributedSampler support for distributed training

Documentation & Examples:

  • Comprehensive API reference documentation
  • Quickstart guide with common use cases
  • Performance optimization guide
  • Troubleshooting documentation
  • Jupyter notebook examples for PyTorch and Lightning

Testing:

  • 180 passing tests across all modules
  • Performance benchmarks for throughput and memory
  • Integration tests for PyTorch, Lightning, and Transformers
  • Transform and encoding correctness tests

All tests pass with pre-commit hooks verified.

…ration

Implement comprehensive Python API for deep learning workflows with biological
sequence data, including streaming datasets, transforms, and ML framework
integrations.

Core Features:
- Stream datasets (FASTQ/FASTA/BAM) with PyTorch DataLoader compatibility
- Map-style dataset access via __getitem__ for random access
- Batch generation with collation and padding strategies
- Zero-copy NumPy array integration for efficient memory usage

Dataset Implementations:
- FastqStreamDataset, FastaStreamDataset, BamStreamDataset
- Support for compressed files (gzip, bgzip)
- Metadata tracking and dataset statistics
- Pickle support for multiprocessing (num_workers > 0)

Transforms & Augmentation:
- OneHotEncoder, IntegerEncoder, KmerEncoder for sequence encoding
- ReverseComplement, Mutator, Sampler for data augmentation
- QualityFilter, LengthFilter for data filtering
- QualitySimulator for synthetic quality score generation
- Transform composition with Compose and FilterCompose

ML Framework Integration:
- PyTorch DataLoader with multiprocessing support
- PyTorch Lightning BiologicalDataModule
- Hugging Face Transformers Trainer compatibility
- DistributedSampler support for distributed training

Documentation & Examples:
- Comprehensive API reference documentation
- Quickstart guide with common use cases
- Performance optimization guide
- Troubleshooting documentation
- Jupyter notebook examples for PyTorch and Lightning

Testing:
- 180 passing tests across all modules
- Performance benchmarks for throughput and memory
- Integration tests for PyTorch, Lightning, and Transformers
- Transform and encoding correctness tests

All tests pass with pre-commit hooks verified.
Remove placeholder test files and enable all previously ignored dataset tests
to achieve 100% test coverage with zero ignored tests.

Changes:
- Remove placeholder test files in deepbiop-fq that contained only commented code
  - Deleted test_compression.rs (234 lines of unused placeholder code)
  - Deleted test_dataset.rs (175 lines of unused placeholder code)

- Enable and fix 3 FASTA dataset tests in deepbiop-fa
  - Removed #[ignore] attributes from dataset creation, iteration, and
    multiple iteration tests
  - All tests now run successfully with existing test data

- Enable and fix 4 BAM dataset tests in deepbiop-bam
  - Removed #[ignore] attributes from all dataset tests
  - Updated file paths from test.bam to test_chimric_reads.bam
  - Tests validate dataset creation, iteration, multiple iterations,
    and multi-threaded operation

- Fix test imports and feature gates
  - Properly gate parquet-related test imports with #[cfg(feature = "cache")]
  - Move imports inside test functions to avoid unused import warnings
  - Ensure EncoderOptionBuilder is imported where needed

Test Results:
- Rust: 206 total tests passing, 0 ignored (was 7 ignored)
  - deepbiop-bam: 8 passed, 0 ignored (was 4 ignored)
  - deepbiop-core: 43 passed, 0 ignored
  - deepbiop-fa: 31 passed, 0 ignored (was 3 ignored)
  - deepbiop-fq: 100 passed, 0 ignored
  - deepbiop-utils: 24 passed, 0 ignored
- Python: 180 passed, 18 skipped (optional dependencies)

All pre-commit hooks passing.
Add test-python.yml workflow for Python CI with three jobs:

**test job** - Multi-platform and multi-version testing
- Platforms: Ubuntu, macOS, Windows
- Python versions: 3.10, 3.11, 3.12
- Uses uv for fast dependency management
- Builds Rust extension with maturin
- Runs full pytest suite
- Generates coverage report (Linux + Python 3.10 only)
- Uploads coverage to Codecov

**lint job** - Code quality checks
- Runs ruff for code formatting and linting
- Checks docstring coverage with interrogate
- Ensures code meets quality standards

**minimum-versions job** - Compatibility verification
- Tests with Python 3.10 (minimum supported version)
- Verifies version requirements match pyproject.toml

**Features:**
- Concurrent execution with automatic cancellation
- Caching for both Rust and Python dependencies
- Fail-fast disabled to see all failures
- Minimal debug symbols for faster builds

**Related changes:**
- Renamed test.yml to test-rust.yml for clarity
- Updated pyproject.toml with correct requires-python (>=3.10)
Refine test-python.yml workflow:
- Add -ls flag to pytest for better test output visibility
- Remove redundant pytest-cov installation step
- Remove minimum-versions job (redundant with test matrix)
- Fix YAML formatting (spacing in comment)

Update Python dependencies:
- Add pytest-cov>=7.0.0 to dev dependencies
- Update uv.lock with coverage 7.11.1 package

This streamlines the CI workflow while ensuring coverage reporting
works correctly across all test runs.
@cauliyang cauliyang closed this Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant