-
Notifications
You must be signed in to change notification settings - Fork 216
Basic Operations
It is useful to think of vg as a toolbox for manipulating the internal variation graph representation. Almost any goal can be expressed in terms of operations on the graph; while developing this way of thinking initially requires some mental gymnastics, in the end it makes the user's life simpler by abstracting their problem out of the biological context and into a more general one. For example, variant calling can be expressed as the series of construct, index, giraffe (map), and call operations, with inputs and outputs along the way.
# Construct
vg autoindex -r small/x.fa -v small/x.vcf.gz -w sr-giraffe -p graph
# Map
vg giraffe -f small/x.fa_1.fastq -Z graph.giraffe.gbz > mapped.gam
# Convert (to a more flexible format)
vg convert graph.giraffe.gbz -p > graph.pg
# Note where reads land in graph
vg augment graph.pg mapped.gam -m 4 -q 5 -Q 5 -A mapped.aug.gam > graph.aug.pg
# Read pileup/depth
vg pack -x graph.aug.pg -g mapped.aug.gam -o graph.aug.pack
# Call
vg call graph.aug.pg -k graph.aug.pack > mapped.vcf
Let's walk through a basic vg pipeline like the one above step-by-step to help illustrate what all it can do.
You probably don't need to construct your own graph (though if you do, see our page with Construction Examples) if you're working with human data, since the HPRC already provides a high-quality human pangenome reference. However, if you do need to make a graph, you can cut through a lot of fiddly bits by using our all-in-one vg autoindex command.
vg autoindex will take a reference genome (-r), and optionally VCFs (-v, variants) and/or GFFs (-x/-H, transcription annotations), and produce everything you need for downstream use. Depending on your downstream use, choose --workflow to be:
-
mpmapfor short-read multipath (e.g. RNA-seq) mapping withvg mpmap -
sr-giraffefor normal short-read alignment withvg giraffe -
lr-giraffefor normal long-read alignment withvg giraffe -
samplingfor producing indexes for pangenome personalization ("haplotype sampling") withvg haplotypes
If you provide only a reference genome without any variation, you'll end up with a linear reference "stick graph". While vg will operate on this graph, it wouldn't be particularly interesting (bwa could do the same things).
Okay, we've now created a graph and generated our indices for mapping reads, so let's map some. For most use-cases we recommend vg giraffe. The exact syntax to use depends on whether your input reads are paired, and, if they are paired, how they are stored.
## Single end reads in a single FASTQ
vg giraffe -f small/x.fa_1.fastq -Z graph.giraffe.gbz > mapped.gam
## Paired end reads interleaved in a single FASTQ
vg giraffe -f small/x.fa_1.fastq -Z graph.giraffe.gbz -i > mapped.gam
## Paired end reads in paired FASTQs
vg giraffe -f small/x.fa_1.fastq -f small/x.fa_2.fastq -Z graph.giraffe.gbz > mapped.gam
The mappings are output in GAM format, an analogue of BAM for graphs. Statistics from the GAM can be pulled for QC by vg filter: Getting alignment statistics with vg filter.
We recommend using DeepVariant for variant calling. The vg call pipeline presented here is just a simple example pipeline.
There are two variant callers in vg. The vg call variant caller relies on a samtools pileup style structure, but is more well-tuned and highly tested, and is currently the recommended variant caller. There is also vg genotype, which uses a more FreeBayes-like approach to variant calling, but it is harder to get good results with.
vg call deserves its own wiki page, so we'll just show some quick commands here. Briefly the graph is first augmented with variants suggested by the read pileup using vg augment. vg call then uses this new graph and the statistics about the read pileup to generate variant calls.
vg augment graph.pg mapped.gam -m 4 -q 5 -Q 5 -A mapped.aug.gam > graph.aug.pg
vg pack -x graph.aug.pg -g mapped.aug.gam -o graph.aug.pack
vg call graph.aug.pg -r graph.aug.snarls -k graph.aug.pack > mapped.vcf
At this point, you've gone from a reference genome, some population variation and a set of reads to final variant calls. Next, you can read Index Construction for more detailed information on how indexes are created, or Working with a whole genome variation graph for a walkthrough of how to construct, index, and align to a genome-scale graph. We've also introduced a bunch of file formats; see File Types for more information on them.