R package to visualize TCR specificity profiles.
MixTCRviz can be used freely by academic groups for non-commercial purposes (see license). The product is provided free of charge, and, therefore, on an "as is" basis, without warranty of any kind.
FOR-PROFIT USERS If you plan to use MixTCRviz or any data provided with the script in any for-profit application, you are required to obtain a separate license. To do so, please contact Nadette Bulgin (nbulgin@lcr.org) at the Ludwig Institute for Cancer Research Ltd.
For scientific questions, please contact David Gfeller (David.Gfeller@unil.ch)
Copyright (2024) David Gfeller
MixTCRviz is a tool to visualize important properties of a set of human or mouse TCRs (e.g., epitope-specific TCRs). It focuses on V usage, J usage, CDR3 length distribution and CDR3 motifs, for both the alpha and the beta chains. These define the so-called TCR specificity profiles.
By default, properties of the input TCRs are compared to those of "baseline" TCR repertoires. Alternatively, users can choose to compare with another set of TCRs ("input2").
TCR specificity profiles enable users to rapidly visualize and understand what are the key determinants of specificity within a set of TCRs, such as those binding to a given epitope.
-
Downlad the ggseqlogoMOD package from https://github.com/GfellerLab/ggseqlogo and install it (this is needed even if you already have the standard ggseqlogo package).
-
Download the MixTCRviz directory from the GitHub page and open Rstudio setting its working directory as MixTCRviz folder. You can then compile it and install it:
r
devtools::build()
install.packages("../MixTCRviz_1.0.tar.gz", repos=NULL)
You may be prompted to install some packages (e.g., ggplot2).
In the MixTCRviz directory run:
Rscript test_MixTCRviz.R
Alternatively, you can run the code test_MixTCRviz.R in Rstudio (or any R interface)
The output in test/out should be the same as in test/out_compare.
The test_MixTCRviz.R also contains examples on how to use MixTCRviz with a specific baseline or including V/J allele information (see below for more explanations).
MixTCRviz should be primarily run in R, by loading the MixTCRviz library and calling the MixTCRviz function (e.g., MixTCRviz(input1="test/test.csv, output.path=YOUR_OUTPUT_PATH)).
A python wrapper is also available.
-
input1: Input1 can one of the different values:
- A .csv or .txt or .tsv file with the input TCRs
- A data.frame with the input TCRs
- A list in the MixTCRviz format.
If using a filename or a data.frame:
- Columns should ideally consist of "TRAV","TRAJ","cdr3_TRA","TRBV","TRBJ","cdr3_TRB". Alternatively, the option infer.VJ can be used to retrieve V and J genes from full TCR sequences (provided in "TCRa" and "TCRb" columns).
- If TCRs from multiple experiments/epitopes/classses/... are provided in the same file, the "model" column should indicate the models (e.g. each epitope). ("Model_default" is used by default if no "model" column is provided and model.default=NULL, see below). The use of multiple model is only possible in input1.
- If TCRs from multiple species are provided in the same file, the "species" should indicate the species of the TCRs ("HomoSapiens" or "MusMusculus", with "HomoSapiens" being the default if no "species" column and species.default=NULL)
- The "TRAV", "TRAJ", "TRBV", "TRBJ" entries should follow the IMGT nomenclature, with or without allele (see below for potential name correction).
- The "cdr3_TRA" and "cdr3_TRB" columns should provide CDR3A/CDR3B sequences, following the standard definition (e.g., CAVNSDGQKLLF). Cases with non-amino acid characters, or length < 7 or > 22 will be not be considered (i.e., put to NA).
- Other formats are supported (see below)
Below are some of the most comonly used parameters. Full documentation about other parameters is available in the R package.
-
output.path: name of the output directory (if not already existing, it will be created). If existing the files with the same name will be overwritten. It can be left empty ONLY with the option return.object=2.
-
input2 (default=NULL): .csv file or data.frame containing a second set of TCRs to be used in comparisons. Same format as input1, except that all TCRs are assumed to come from a single model so any data in the "model" field is omitted. If input2 is provided, the comparisons is performed with this second input, and not the baseline repertoire. In this case, renormVJ is equal to FALSE by default, which is convenient for instance to compare TCRs binding to two different epitopes. If input2 is meant to represent a specific baseline repertoires (e.g., from the same donor as the data provided in input1), users should set renormVJ=TRUE. In this case the comparison of the CDR3 lenght and CDR3 motifs considers the specificity in V/J usage of input1. Other than .csv filename or data.frame, a third alternative consists of a list or .rds with precomputed statistics, such as those generated by MixTCRviz in output.path/stats/.rds files.
-
filename.output (default=NULL). Provide a name for the output file. If NULL, the value(s) in the 'model' column of input1 are used (i.e., input1$model). If no 'model' column is provided, the value in model.default is used.
-
plot.oneline (default=0)
- 0: Show the data on two lines (better for clarity).
- 1: Show all plots in a single line (can be useful to compare different models).
- 2: Show only V/J usage and length (i.e., do not show CDR3 motifs for the most frequent CDR3 length)
-
interactive.plots (default=F):
- F: Do not create an html file with interactive plots.
- T: Create an html file with interactive plots. Only applicable if return.object != 2 and plot=T
-
renormVJ (default=T, unless data are provided in input2):
- T: The comparison with the baseline repertoire (or the repertoire provided in input2) considers the V/J usage of input1.
- F: The comparison with the baseline repertoire does not consider the V/J usage of input1.
-
plot.all.length (default=F):
- F: Plot the CDR3 motifs for the dominant length in input1.
- T: Plot V usage, J usage and CDR3 motifs for all CDR3 lengths with enough TCRs (at least 5 TCRs, or at least 5% if the $countL objects corresponds to normalized probabilities).
-
use.allele (default=F):
- F: Do not show the different alleles of V/J genes (e.g., TRAV12-2*01 and TRAV12-2*02). This the recommanded option since allele information is not always reliable and establishing a correct baseline at the allele level is challenging
- T: Keep information about the allele of V/J genes. If this option is selected, users are strongly encouraged to provide a separate baseline in input2 from the same donor as the data in input1, since comparison with the default baseline can be misleading due to different V/J alleles across donors.
-
model.default: Name of the model if no "model" column is given in input1. This will determine the name used in the output files
If the output of MixTCRviz is assigned to a variable (e.g., m <- MixTCRviz(input1="test/test.csv")), MixTCRviz returns a list with the plots, the processed data and the statistics for each model in input1.
If output.path is given, MixTCRviz creates also a directory (output.path). The output.path/ directory contains the TCR specificity profiles (e.g. pdf files) for each model.
- If interactive.plots==T, interactive html files are also created, and you can mouse over the V/J genes to see the names, the Z-score, the logFC and the CDR3 sequences.
- If output.stat==T, the output.path/stats/ contains .rds files with all the stats for each model.
- If output.processed.data==T, the output.path/processed_data/ contains .csv files with the actual data used to build the TCR specificity profiles.
- If plot.logo.length==1, the output.path/CDR3_length/ directory shows the V/J usage and CDR3 motifs for multiple lengths for both chains.
By default, MixTCRviz uses column names c("TRAV","TRAJ","cdr3_TRA","TRBV","TRBJ","cdr3_TRB") to define a TCR, "species" to indicate the species and "model" to define groups of TCRs (e.g., binding to the same epitope).
- For single-chain data, only one chain can be provided. In those cases, it is recommanded to define the chain in chain="A" or "B".
- "model" can be skipped, in which case all TCRs will be analyzed together and the output file will take the name given in model.default (default="Model_default").
- "species" can be skipped, in which case all TCRs are assumed to come from the same species given in species.default (default="HomoSapiens"). If you have data coming from multiple species, you need to have the "species" column.
- Other column names are supported, including "Va", "V_alpha", "CDR3a", "CDR3A", "CDR3_alpha", etc. for data with both chain. Or "V", "v_gene","V-region","aaSeqCDR3","CDR3",etc for single chain data, see list in data_raw/TidyVJ/mapping_colnames.csv for a full description
Other supported formats treating each chain in a different row include:
- VDJdb with the columns: c("V", "J", "CDR3")
- 10X Genomics format with columns: c("v_gene", "j_gene", "cdr3")
- Qiagen with the columns: c("V-region", "J-region", "CDR3 amino acid seq")
- Adaptive Biotech with the columns: c("vGeneName", "jGeneName", "aminoAcid")
- Adaptive Biotech v4 with the columns: c("v_resolved", "j_resolved", "amino_acid")
- AIRR with the columns: c("v_call", "j_call", "junction_aa")
- MiXCR with the columns: c("allVHitsWithScore", "allJHitsWithScore", "aaSeqCDR3") or c("allVGenes", "allJGenes", "aaSeqCDR3")
By default, both chains are treated independently, without reconstructing clones. To reconstruct alpha-beta clones, you can use the build.clones=T. However, you need to have exactly one colum indicating the clone_id labelled as "clone_id", "cell_id", "cloneId", "barcode" or "complex.id"
-
If working with epitope-specific TCRs, we encourage users to define model names which capture both the epitope sequence and the MHC restriction. Using only epitope sequences as "model" is possible, but can lead to issues when the same epitope is restricted to different MHC.
-
V/J genes are key to the TCR specificity profiles in MixTCRviz and only V/J names compatible with the IMGT nomenclature can be considered. Even if correct.gene.names==1 allows to correct several wrong V/J names, we strongly encourage the users to use only V/J gene names compatible with IMGT.
-
Some V/J genes in IMGT give rise to truncated V segments (e.g., TRAV8-5). All of them are pseudogenes. These are not supported in MixTCRviz and will be put to NA. Other pseudogenes / ORF are shown in grey in the plots.
-
TRBV6-2 and TRBV6-3 have exactly the same nucleotide sequence, and therefore cannot be distinguished at the sequecing level. In MixTCRviz, these entries are mapped into a single 'TRBV6-2/6-3' gene.
-
In the default setting, error bars on the baseline distributions of V/J segments represent the variability observed across multiple studies, encompassing different sequencing protocols, different centers and different donors.
-
PCR / Sequencing / TCR reconstruction errors frequently occur in TCR-Seq data. Although the option check.cdr3.mode = 1, can detect some of these errors, we encourage users to carefully check the quality of their CDR3 sequences.
-
As with all statistical approaches, limited numbers of TCRs have a big impact on the interpretation of the results. Therefore, it is hilghly recommanded to use sets of TCRs with enough sequences to be able to interpret frequencies plotted in MixTCRviz.
-
When comparing to baseline TCR repertoire, we encourage to use renormVJ=1, so that the comparisons of CDR3 length distributions and motifs is not confounded by the specific V/J usage in input1 (i.e., baseline shows the expected length distributions and CDR3 motifs knowing P(VJ) in input1). This option is often less relevant when comparing two TCR datasets (input1 and input2), this is why renormVJ is by default put to 0 unless renormVJ = 1.
-
The information about the species is important to choose an appropriate baseline. In case you are using the same model (e.g.,MHC_peptide) with both human and mouse TCRs, we recommend distinguishing both cases in the model column (e.g., A0201_LLWNGPMAV_HomoSapiens and A0201_LLWNGPMAV_MusMusculus) and indicating the species in the appropriate column.
-
When using use.allele=T, it is important to realize that presence/absence of specific alleles in a given sample also reflects the genetic background of the donor. For these reasons, some alleles may appear to be enriched in the input TCRs versus default baseline repertoires, but this enrichment may not be linked to any signal of specificity (e.g., epitope specificity). In addition, determining the correct allele from TCR-Seq data can be challenging, and sequencing errors can easily result in wrong allele calls. We therefore recommend analyzing data at the gene level (use.allele=F) to avoid confounding factors related to genetic background/TCR reconstruction issues. Alternatively, the baseline TCR repertoire can also be sequenced in each patient, and used as input for MixTCRviz (input2 parameter).