Pangaea

Pangaea is designed for linked-read assembly or hybrid asembly using short- and long-reads. It includes (1) short-reads binning using variational autoencoder (2) multi-thresholding reassembly and (3) ensemble of different subassemblies.

Installation

We provide an all-in-one installation script (build.sh) that sets up all dependencies, Conda environments, and builds the required tools. The process typically takes 20 minutes to 1 hour, depending on your system and internet speed.

Automated Installation

git clone https://github.com/ericcombiolab/Pangaea.git
cd Pangaea
./build.sh

Step-by-step Installation Details

Set up Conda environments

The script uses mamba (a faster drop-in replacement for conda) to create environments.

Install the main pangaea environment:

# if mamba is not installed, install mamba first
conda install conda-forge::mamba
# this will create conda env pangaea
mamba env create -f environment.yaml

Install the athena-meta environment for Athena:

# this will create conda env athena-meta
mamba create -n athena-meta bioconda::athena_meta

Activate the pangaea environment:
```
conda activate pangaea
```

Install PyTorch
- Please follow PyTorch's official instructions for your system and hardware. For CPU-only installation, you can use:
```
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
```

Install rph_kmeans

Install the rph_kmeans package from source:

cd third_parties/rph_kmeans
python setup.py install
cd ../../

Build C++ tools

Compile the C++ binaries required by Pangaea:

cd src/cpptools
rm -rf build
mkdir build && cd build
cmake ..
make
cd ../../../

Set up MetaPhlan4 database (optional)
- After running build.sh, you can choose to download the MetaPhlan4 database:
```
# Download the database to ./metaphlan4_DB
./build_db.sh
# Or specify a custom directory (ensure >25GB free space):
./build_db.sh -d [your_preferred_directory]
```
- MetaPhlan4 is used for automatic cluster number selection. If you prefer to specify the cluster number manually (recommended, as results are not very sensitive to this parameter), you can skip the MetaPhlan4 database step and use the -c [number] option when running Pangaea.
- The cluster number is a trade-off: a larger value produces more and smaller read bins (lower complexity per bin), while a smaller value keeps more reads from the same microbe together. We suggest -c 30 for most datasets. For very complex datasets, try -c 35 or -c 40. For mock or simulation data, we suggest using a value lower than the real species number or a small value (e.g., -c 10).

Running Pangaea

The run_pangaea executable is located in the src directory.

Usage: ./src/run_pangaea [OPTIONS]
Required arguments:
  -s, --short_type <string>       Short reads type: short, stlfr, tellseq, 10x
                                    If 'short', hybrid assembly is performed and -l, -H, and -p are required.
                                    If 'stlfr', 'tellseq', or '10x', linked assembly is performed and -l, -H, and -p must NOT be set.
                                    For 'stlfr' or 'tellseq', original reads can be directly provided to Pangaea without preprocessing.
                                    For '10x', please ensure barcodes are in the BX:Z tag of the read headers (this can be done using 'longranger basic', followed by deinterleaving the reads).
  -r, --short_R1 <file>           Short reads R1 file
  -R, --short_R2 <file>           Short reads R2 file
  -I, --index <file>              Barcode index for Tell-Seq (required if -s is 'tellseq'; this file is provided with the reads)
  -o, --output_dir <dir>           Output directory (required)
Hybrid assembly (required if -s is 'short'):
  -m, --metaphlan_db <file>       Metaphlan database for species detection (required if -c is 'metaphlan')
  -l, --longreads <file>          Long reads file
  -H, --hybrid_asm <string>       Hybrid assembler: hybridspades, metaplatanus (default: hybridspades)
  -p, --longreads_type <string>   Long reads type: pacbio or nanopore
Optional arguments:
  -c, --cluster <int>             Number of clusters for read binning (default: 30; input metaphlan to detect species number by metaphlan)
  -t, --threads <int>             Number of threads to use (default: 50; applied to all tools that support it)
  -h, --help                      Show this help message and exit

The assembled contigs will be in output_dir/final_asm.fa.

Example of running linked-read assembly

cd example/linked_reads_example
../../src/run_pangaea -s 10x -r atcc_short_R1.fastq.gz -R atcc_short_R2.fastq.gz -o pangaea

Example of running hybrid assembly

cd example/hybrid_example
../../src/run_pangaea -s short -r ../linked_reads_example/atcc_short_R1.fastq.gz -R ../linked_reads_example/atcc_short_R2.fastq.gz -l atcc_longreads_small.fastq.gz -p pacbio -o pangaea

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
example		example
src		src
third_parties		third_parties
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
build_db.sh		build_db.sh
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pangaea

Installation

Automated Installation

Step-by-step Installation Details

Running Pangaea

Example of running linked-read assembly

Example of running hybrid assembly

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pangaea

Installation

Automated Installation

Step-by-step Installation Details

Running Pangaea

Example of running linked-read assembly

Example of running hybrid assembly

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages