Master thesis by Lukas Gessl.
Chemotherapy with R-CHOP is the standard treatment for diffuse large B-cell lymphoma, the most common type of non-Hodgkin lymphoma, achieving a cure for about two thirds of patients. Survival for the remaining third with refractory or relapsed disease, however, remains poor. Pharma-sponsored randomized trials in the whole DLBCL population to date have failed to improve R-CHOP. The International Prognostic Index (IPI), the only widely accepted risk-assessment tool for DLBCL and an easy clinical test, fails to identify a high-risk DLBCL subpopulation that is large and precise enough to trigger research and enable clinical trials for new treatments that outperform R-CHOP on this subpopulation.
This thesis aims to develop a computational method that identifies DLBCL patients with progression-free survival (PFS) below two years with higher prevalence and significantly higher precision than the IPI and wants to show this on independent data. It also deals with the question under which circumstances we can do so reliably. By a significantly higher precision, we mean that the 95%-confidence interval of the precision of our model must not include the precision of the IPI on independent test data. We develop the models in a train-validation-test split of our data, where we fit and validate a bunch of models on a training set, pick the best validated model and test it on a test set.
We apply our methods to three different data sets and a big one comprised of these three. We show that we can indeed deliver a model with the desired properties. Analysis after freezing the models and unlocking the test data suggest that, for a reliable internal validation and high test performance,
- data sets with a large number of samples, even if they result from combining somewhat different, partly non-prospective data sets,
- relying on already-existing molecular signatures rather than fitting new ones and
- deploying simple, generalized linear models that can handle batch effects
play a key role.
datais supposed to hold the data sets. We used three of them for this thesis and a big one combining all three of them indata/all.documentsholds presentation slides for the progress talk and the final talk — compiling them from source does not work due to missing included plot files — and the thesis itself.modelsis supposed to hold the models we train and validate as.rdsfiles.resultsholds validation and testing results as well meta analysis in the form of tables and plots.srcholds all the source code to preprocess data and reproduce the results of this thesis. It takes the center role in this repo and we therefore dedicate the next section to it.
We recommend reading the thesis first.
We outsourced all reusable
code of this thesis to the R package patroklos, which
is tailored for this thesis and still applicable to a more general class of problems, namely
machine-learning projects that aim to develop a model predicting thresholded survival in the
classical train-validate-test split.
- We used R version 4.4.1.
- Install the latest version of
patroklosfrom GitHub (see there for more). Installing it will make sure you have installed almost all depending R packages already as well. - Preprocessing needs the
biomaRtpackage, version >= 2.60.0, from Bioconductor to map all gene names to HGNC symbols. - If you want to use the Fira Sans font by Mozilla
in plots, set
use_fira_sans <- TRUEinsrc/assess/config.R, install the font from GitHub and install the sysfonts package from CRAN (we used version 0.8.9) in plots. Otherwise setuse_fira_sans <- FALSEinsrc/assess/config.R.
Only one of the thee data sets, the Schmitz data, is publicly available.
src/prepro/schmitz.R will download it
if it is not available locally. To gain access to the two other data sets (and hence to the
combined data set), ask the system administrator of the Spang lab,
Christian Kohler, for access to our compute servers
where we provide all three data sets on a mounted volume.
With the above requisites fulfilled, you can now run
Rscript src/run_all.R
in your terminal from the root directory of this repo. In general, all scripts below src are
expected to be run with the repo root directory as the current working directory.
The structure of src
src/preproholds the scripts preprocessing the four data sets intoDataR6 objects, the data formatpatroklosworks with.src/modelsholds the scripts that define all trained and validated models with their hyperparameters for every data set. We do so by initializing a bunch ofModelR6 objects, the model formatpatroklosworks with.src/trainholds the scripts that fit the models defined belowsrc/modelsand validate themModel-internally by callingpatroklos::training_camp(). They store the readily trained models with their validated predictions belowmodels.src/assessholds the scripts that finalize validation, pick the best model according to validation on the respective training cohort and test it on the test cohort. Beyond the error, they calculate a bunch of model properties that show up as tables belowresults. TheAssScalarR6 class frompatroklosis the working horse of this directory.src/analyzeholds the scripts that unfreeze the respective test data for all models to do some meta analysis about the trained models and the validation: reporting the non-zero coefficients of the picked models, plots on thresholding the continuous output of the picked models and plots on validation versus test error for all models. TheAss2dR6 class frompatroklosandpatroklos::val_vs_test()are the stars of this directory.