Detection of a high-risk DLBCL group

Master thesis by Lukas Gessl.

The problem

Chemotherapy with R-CHOP is the standard treatment for diffuse large B-cell lymphoma, the most common type of non-Hodgkin lymphoma, achieving a cure for about two thirds of patients. Survival for the remaining third with refractory or relapsed disease, however, remains poor. Pharma-sponsored randomized trials in the whole DLBCL population to date have failed to improve R-CHOP. The International Prognostic Index (IPI), the only widely accepted risk-assessment tool for DLBCL and an easy clinical test, fails to identify a high-risk DLBCL subpopulation that is large and precise enough to trigger research and enable clinical trials for new treatments that outperform R-CHOP on this subpopulation.

The solution

This thesis aims to develop a computational method that identifies DLBCL patients with progression-free survival (PFS) below two years with higher prevalence and significantly higher precision than the IPI and wants to show this on independent data. It also deals with the question under which circumstances we can do so reliably. By a significantly higher precision, we mean that the 95%-confidence interval of the precision of our model must not include the precision of the IPI on independent test data. We develop the models in a train-validation-test split of our data, where we fit and validate a bunch of models on a training set, pick the best validated model and test it on a test set.

The results

We apply our methods to three different data sets and a big one comprised of these three. We show that we can indeed deliver a model with the desired properties. Analysis after freezing the models and unlocking the test data suggest that, for a reliable internal validation and high test performance,

data sets with a large number of samples, even if they result from combining somewhat different, partly non-prospective data sets,
relying on already-existing molecular signatures rather than fitting new ones and
deploying simple, generalized linear models that can handle batch effects

play a key role.

The structure of this repository

data is supposed to hold the data sets. We used three of them for this thesis and a big one combining all three of them in data/all.
documents holds presentation slides for the progress talk and the final talk — compiling them from source does not work due to missing included plot files — and the thesis itself.
models is supposed to hold the models we train and validate as .rds files.
results holds validation and testing results as well meta analysis in the form of tables and plots.
src holds all the source code to preprocess data and reproduce the results of this thesis. It takes the center role in this repo and we therefore dedicate the next section to it.

Reproducing the results

We recommend reading the thesis first.

We outsourced all reusable code of this thesis to the R package patroklos, which is tailored for this thesis and still applicable to a more general class of problems, namely machine-learning projects that aim to develop a model predicting thresholded survival in the classical train-validate-test split.

Prerequisites

Software

We used R version 4.4.1.
Install the latest version of patroklos from GitHub (see there for more). Installing it will make sure you have installed almost all depending R packages already as well.
Preprocessing needs the biomaRt package, version >= 2.60.0, from Bioconductor to map all gene names to HGNC symbols.
If you want to use the Fira Sans font by Mozilla in plots, set use_fira_sans <- TRUE in src/assess/config.R, install the font from GitHub and install the sysfonts package from CRAN (we used version 0.8.9) in plots. Otherwise set use_fira_sans <- FALSE in src/assess/config.R.

Data

Only one of the thee data sets, the Schmitz data, is publicly available. src/prepro/schmitz.R will download it if it is not available locally. To gain access to the two other data sets (and hence to the combined data set), ask the system administrator of the Spang lab, Christian Kohler, for access to our compute servers where we provide all three data sets on a mounted volume.

All in one run

With the above requisites fulfilled, you can now run

Rscript src/run_all.R

in your terminal from the root directory of this repo. In general, all scripts below src are expected to be run with the repo root directory as the current working directory.

The structure of `src`

src/prepro holds the scripts preprocessing the four data sets into Data R6 objects, the data format patroklos works with.
src/models holds the scripts that define all trained and validated models with their hyperparameters for every data set. We do so by initializing a bunch of Model R6 objects, the model format patroklos works with.
src/train holds the scripts that fit the models defined below src/models and validate them Model-internally by calling patroklos::training_camp(). They store the readily trained models with their validated predictions below models.
src/assess holds the scripts that finalize validation, pick the best model according to validation on the respective training cohort and test it on the test cohort. Beyond the error, they calculate a bunch of model properties that show up as tables below results. The AssScalar R6 class from patroklos is the working horse of this directory.
src/analyze holds the scripts that unfreeze the respective test data for all models to do some meta analysis about the trained models and the validation: reporting the non-zero coefficients of the picked models, plots on thresholding the continuous output of the picked models and plots on validation versus test error for all models. The Ass2d R6 class from patroklos and patroklos::val_vs_test() are the stars of this directory.

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
.vscode		.vscode
data		data
documents		documents
models		models
results		results
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detection of a high-risk DLBCL group

The problem

The solution

The results

The structure of this repository

Reproducing the results

Prerequisites

Software

Data

All in one run

The structure of `src`

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Detection of a high-risk DLBCL group

The problem

The solution

The results

The structure of this repository

Reproducing the results

Prerequisites

Software

Data

All in one run

The structure of src

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

The structure of `src`

Packages