1000genome-workflow

HyperFlow port of the Pegasus 1000genome workflow for identifying mutational overlaps using data from the 1000 Genomes Project.

Overview

This workflow analyzes genetic variation data from the 1000 Genomes Project to identify mutational overlaps across populations. It processes VCF files for multiple chromosomes through a series of analysis steps including individual processing, sifting, mutation overlap detection, and frequency calculation.

flowchart LR
    subgraph Input
        VCF["VCF Files<br/>(variants)"]
        ANN["Annotations<br/>(SIFT scores)"]
        POP["Population<br/>files"]
    end

    subgraph Workflow["HyperFlow Workflow"]
        IND["individuals<br/>×N parallel"]
        MRG["merge"]
        SFT["sifting"]
        MUT["mutation_overlap<br/>×7 populations"]
        FRQ["frequency<br/>×7 populations"]
    end

    subgraph Output
        TAR["Analysis<br/>results"]
        PLT["Plots"]
    end

    VCF --> IND --> MRG --> MUT & FRQ --> TAR & PLT
    ANN --> SFT --> MUT
    POP --> MUT & FRQ

End-to-End Pipeline

This project implements the workflow composer agent which enables a 5-phase pipeline from natural language research questions to executed workflows. Note that the composer provides plan for phases 3 and 5, but they need to be executed by workflow execution agents on the target system.

flowchart LR
    subgraph Step1["1. INTERPRET"]
        A["Research Question"]
    end

    subgraph Step2["2. PLAN"]
        B["Advisory Plan +<br/>Estimated Workflow"]
    end

    subgraph Step3["3. EXTRACT"]
        C["Data via tabix"]
    end

    subgraph Step4["4. GENERATE"]
        D["workflow.json"]
    end

    subgraph Step5["5. EXECUTE"]
        E["HyperFlow"]
    end

    A --> B --> C --> D --> E

    style A fill:#e1f5fe
    style B fill:#fff9c4
    style C fill:#fff3e0
    style D fill:#e8f5e9
    style E fill:#f3e5f5

Phase	Description
INTERPRET	Parse natural language research question into structured intent
PLAN	Create advisory plan with estimated workflow for validation
EXTRACT	Acquire genomic data via tabix remote extraction
GENERATE	Create final workflow.json from actual data counts
EXECUTE	Run workflow with HyperFlow + Docker workers

See workflow-composer/README.md for detailed phase documentation.

Repository Structure

1000genome-workflow/
├── workflow-composer/      # Native workflow generator + MCP server (recommended)
├── worker-base-image/      # Base Docker image with analysis scripts
├── worker-image/           # HyperFlow worker image (Kubernetes)
├── workflow-generator/     # Legacy DAG generation (Pegasus-based)
├── data-container/         # Input data + workflow.json (~1.7GB image)
├── tests/integration/      # Integration tests with Docker Compose
├── scripts/                # Utility scripts
└── fargate/                # AWS Fargate-specific components (legacy)

Docker Images

Image	Description
`hyperflowwms/1000genome-worker-base`	Base image with Python analysis scripts
`hyperflowwms/1000genome-worker`	HyperFlow worker with job-executor
`hyperflowwms/1000genome-generator`	Workflow DAG generator
`hyperflowwms/1000genome-mcp`	MCP server for AI integration
`hyperflowwms/1000genome-data`	Input data (VCF + annotations, ~1.7GB)

Input Data

The workflow requires input data (~1.7GB) including VCF files, annotation files, and population sample lists.

Note: Annotation files (~1.2GB) are not stored in the git repository due to size. Use one of the following methods:

Option 1: Use the data image (recommended)

# Prepare input data in a local directory
docker run --rm -v $(pwd)/input-data:/mnt/data hyperflowwms/1000genome-data:1.0 sh /prepare_data.sh

Option 2: Download annotation files manually

cd data-container
./download_annotations.sh    # Downloads ~1.2GB from 1000 Genomes FTP
make image                   # Build data image locally

See data-container/README.md for details.

Building

# Build all images
make build-all

# Build individual images
make build-worker-base
make build-worker
make build-generator
make build-mcp
make build-data

# Push all images to Docker Hub
make push-all

Generating Workflows

Using workflow-composer (recommended)

The workflow-composer package generates HyperFlow workflows natively in Python:

# Install
pip install -e workflow-composer

# Generate workflow from data.csv
workflow-composer generate \
    --data-csv workflow-generator/data.csv \
    --populations-dir workflow-generator/data/populations \
    --parallelism medium \
    --output workflow.json

Parallelism presets:

Preset	Jobs per chromosome	Use case
small	10	Testing, small regions
medium	50	Standard analysis
large	250	Full genome

See workflow-composer/README.md for details.

Using Docker (legacy)

make generate

Or manually:

cd workflow-generator
docker build -t hyperflowwms/1000genome-generator .
docker run --rm -v $(pwd)/../data:/output hyperflowwms/1000genome-generator \
    sh -c "cd /1000genome-workflow && ./generate_workflow.sh && cp workflow.json /output/"

MCP Server

The MCP server enables AI-assisted workflow generation. See workflow-composer/README.md for details.

Add to Claude Desktop configuration:

{
  "mcpServers": {
    "1000genome": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "hyperflowwms/1000genome-mcp:2.0"]
    }
  }
}

Integration Tests

Run integration tests to validate the complete pipeline:

cd tests/integration

# Test with micro dataset (fast, ~2-3 minutes)
./test-workflow-composer.sh --parallelism small --yes

# Test with real 1000 Genomes data via tabix
./test-hla-region.sh --quick --yes

See tests/integration/README.md for detailed documentation on the end-to-end workflow execution pipeline.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

This project is a HyperFlow port of the Pegasus 1000genome-workflow, originally developed by the University of Southern California. The workflow generator and Pegasus DAX libraries are used under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1000genome-workflow

Overview

End-to-End Pipeline

Repository Structure

Docker Images

Input Data

Option 1: Use the data image (recommended)

Option 2: Download annotation files manually

Building

Generating Workflows

Using workflow-composer (recommended)

Using Docker (legacy)

MCP Server

Integration Tests

License

Acknowledgments

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data-container		data-container
evaluation		evaluation
fargate		fargate
tests		tests
worker-base-image		worker-base-image
worker-image		worker-image
workflow-composer		workflow-composer
workflow-generator		workflow-generator
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

1000genome-workflow

Overview

End-to-End Pipeline

Repository Structure

Docker Images

Input Data

Option 1: Use the data image (recommended)

Option 2: Download annotation files manually

Building

Generating Workflows

Using workflow-composer (recommended)

Using Docker (legacy)

MCP Server

Integration Tests

License

Acknowledgments

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages