HyperFlow port of the Pegasus 1000genome workflow for identifying mutational overlaps using data from the 1000 Genomes Project.
This workflow analyzes genetic variation data from the 1000 Genomes Project to identify mutational overlaps across populations. It processes VCF files for multiple chromosomes through a series of analysis steps including individual processing, sifting, mutation overlap detection, and frequency calculation.
flowchart LR
subgraph Input
VCF["VCF Files<br/>(variants)"]
ANN["Annotations<br/>(SIFT scores)"]
POP["Population<br/>files"]
end
subgraph Workflow["HyperFlow Workflow"]
IND["individuals<br/>×N parallel"]
MRG["merge"]
SFT["sifting"]
MUT["mutation_overlap<br/>×7 populations"]
FRQ["frequency<br/>×7 populations"]
end
subgraph Output
TAR["Analysis<br/>results"]
PLT["Plots"]
end
VCF --> IND --> MRG --> MUT & FRQ --> TAR & PLT
ANN --> SFT --> MUT
POP --> MUT & FRQ
This project implements the workflow composer agent which enables a 5-phase pipeline from natural language research questions to executed workflows. Note that the composer provides plan for phases 3 and 5, but they need to be executed by workflow execution agents on the target system.
flowchart LR
subgraph Step1["1. INTERPRET"]
A["Research Question"]
end
subgraph Step2["2. PLAN"]
B["Advisory Plan +<br/>Estimated Workflow"]
end
subgraph Step3["3. EXTRACT"]
C["Data via tabix"]
end
subgraph Step4["4. GENERATE"]
D["workflow.json"]
end
subgraph Step5["5. EXECUTE"]
E["HyperFlow"]
end
A --> B --> C --> D --> E
style A fill:#e1f5fe
style B fill:#fff9c4
style C fill:#fff3e0
style D fill:#e8f5e9
style E fill:#f3e5f5
| Phase | Description |
|---|---|
| INTERPRET | Parse natural language research question into structured intent |
| PLAN | Create advisory plan with estimated workflow for validation |
| EXTRACT | Acquire genomic data via tabix remote extraction |
| GENERATE | Create final workflow.json from actual data counts |
| EXECUTE | Run workflow with HyperFlow + Docker workers |
See workflow-composer/README.md for detailed phase documentation.
1000genome-workflow/
├── workflow-composer/ # Native workflow generator + MCP server (recommended)
├── worker-base-image/ # Base Docker image with analysis scripts
├── worker-image/ # HyperFlow worker image (Kubernetes)
├── workflow-generator/ # Legacy DAG generation (Pegasus-based)
├── data-container/ # Input data + workflow.json (~1.7GB image)
├── tests/integration/ # Integration tests with Docker Compose
├── scripts/ # Utility scripts
└── fargate/ # AWS Fargate-specific components (legacy)
| Image | Description |
|---|---|
hyperflowwms/1000genome-worker-base |
Base image with Python analysis scripts |
hyperflowwms/1000genome-worker |
HyperFlow worker with job-executor |
hyperflowwms/1000genome-generator |
Workflow DAG generator |
hyperflowwms/1000genome-mcp |
MCP server for AI integration |
hyperflowwms/1000genome-data |
Input data (VCF + annotations, ~1.7GB) |
The workflow requires input data (~1.7GB) including VCF files, annotation files, and population sample lists.
Note: Annotation files (~1.2GB) are not stored in the git repository due to size. Use one of the following methods:
# Prepare input data in a local directory
docker run --rm -v $(pwd)/input-data:/mnt/data hyperflowwms/1000genome-data:1.0 sh /prepare_data.shcd data-container
./download_annotations.sh # Downloads ~1.2GB from 1000 Genomes FTP
make image # Build data image locallySee data-container/README.md for details.
# Build all images
make build-all
# Build individual images
make build-worker-base
make build-worker
make build-generator
make build-mcp
make build-data
# Push all images to Docker Hub
make push-allThe workflow-composer package generates HyperFlow workflows natively in Python:
# Install
pip install -e workflow-composer
# Generate workflow from data.csv
workflow-composer generate \
--data-csv workflow-generator/data.csv \
--populations-dir workflow-generator/data/populations \
--parallelism medium \
--output workflow.jsonParallelism presets:
| Preset | Jobs per chromosome | Use case |
|---|---|---|
| small | 10 | Testing, small regions |
| medium | 50 | Standard analysis |
| large | 250 | Full genome |
See workflow-composer/README.md for details.
make generateOr manually:
cd workflow-generator
docker build -t hyperflowwms/1000genome-generator .
docker run --rm -v $(pwd)/../data:/output hyperflowwms/1000genome-generator \
sh -c "cd /1000genome-workflow && ./generate_workflow.sh && cp workflow.json /output/"The MCP server enables AI-assisted workflow generation. See workflow-composer/README.md for details.
Add to Claude Desktop configuration:
{
"mcpServers": {
"1000genome": {
"command": "docker",
"args": ["run", "-i", "--rm", "hyperflowwms/1000genome-mcp:2.0"]
}
}
}Run integration tests to validate the complete pipeline:
cd tests/integration
# Test with micro dataset (fast, ~2-3 minutes)
./test-workflow-composer.sh --parallelism small --yes
# Test with real 1000 Genomes data via tabix
./test-hla-region.sh --quick --yesSee tests/integration/README.md for detailed documentation on the end-to-end workflow execution pipeline.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This project is a HyperFlow port of the Pegasus 1000genome-workflow, originally developed by the University of Southern California. The workflow generator and Pegasus DAX libraries are used under the Apache License 2.0.