Skip to content

lgessl/master-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

229 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detection of a high-risk DLBCL group

Master thesis by Lukas Gessl.

The problem

Chemotherapy with R-CHOP is the standard treatment for diffuse large B-cell lymphoma, the most common type of non-Hodgkin lymphoma, achieving a cure for about two thirds of patients. Survival for the remaining third with refractory or relapsed disease, however, remains poor. Pharma-sponsored randomized trials in the whole DLBCL population to date have failed to improve R-CHOP. The International Prognostic Index (IPI), the only widely accepted risk-assessment tool for DLBCL and an easy clinical test, fails to identify a high-risk DLBCL subpopulation that is large and precise enough to trigger research and enable clinical trials for new treatments that outperform R-CHOP on this subpopulation.

The solution

This thesis aims to develop a computational method that identifies DLBCL patients with progression-free survival (PFS) below two years with higher prevalence and significantly higher precision than the IPI and wants to show this on independent data. It also deals with the question under which circumstances we can do so reliably. By a significantly higher precision, we mean that the 95%-confidence interval of the precision of our model must not include the precision of the IPI on independent test data. We develop the models in a train-validation-test split of our data, where we fit and validate a bunch of models on a training set, pick the best validated model and test it on a test set.

The results

We apply our methods to three different data sets and a big one comprised of these three. We show that we can indeed deliver a model with the desired properties. Analysis after freezing the models and unlocking the test data suggest that, for a reliable internal validation and high test performance,

  • data sets with a large number of samples, even if they result from combining somewhat different, partly non-prospective data sets,
  • relying on already-existing molecular signatures rather than fitting new ones and
  • deploying simple, generalized linear models that can handle batch effects

play a key role.

The structure of this repository

  • data is supposed to hold the data sets. We used three of them for this thesis and a big one combining all three of them in data/all.
  • documents holds presentation slides for the progress talk and the final talk — compiling them from source does not work due to missing included plot files — and the thesis itself.
  • models is supposed to hold the models we train and validate as .rds files.
  • results holds validation and testing results as well meta analysis in the form of tables and plots.
  • src holds all the source code to preprocess data and reproduce the results of this thesis. It takes the center role in this repo and we therefore dedicate the next section to it.

Reproducing the results

We recommend reading the thesis first.

We outsourced all reusable code of this thesis to the R package patroklos, which is tailored for this thesis and still applicable to a more general class of problems, namely machine-learning projects that aim to develop a model predicting thresholded survival in the classical train-validate-test split.

Prerequisites

Software

  • We used R version 4.4.1.
  • Install the latest version of patroklos from GitHub (see there for more). Installing it will make sure you have installed almost all depending R packages already as well.
  • Preprocessing needs the biomaRt package, version >= 2.60.0, from Bioconductor to map all gene names to HGNC symbols.
  • If you want to use the Fira Sans font by Mozilla in plots, set use_fira_sans <- TRUE in src/assess/config.R, install the font from GitHub and install the sysfonts package from CRAN (we used version 0.8.9) in plots. Otherwise set use_fira_sans <- FALSE in src/assess/config.R.

Data

Only one of the thee data sets, the Schmitz data, is publicly available. src/prepro/schmitz.R will download it if it is not available locally. To gain access to the two other data sets (and hence to the combined data set), ask the system administrator of the Spang lab, Christian Kohler, for access to our compute servers where we provide all three data sets on a mounted volume.

All in one run

With the above requisites fulfilled, you can now run

Rscript src/run_all.R 

in your terminal from the root directory of this repo. In general, all scripts below src are expected to be run with the repo root directory as the current working directory.

The structure of src

  • src/prepro holds the scripts preprocessing the four data sets into Data R6 objects, the data format patroklos works with.
  • src/models holds the scripts that define all trained and validated models with their hyperparameters for every data set. We do so by initializing a bunch of Model R6 objects, the model format patroklos works with.
  • src/train holds the scripts that fit the models defined below src/models and validate them Model-internally by calling patroklos::training_camp(). They store the readily trained models with their validated predictions below models.
  • src/assess holds the scripts that finalize validation, pick the best model according to validation on the respective training cohort and test it on the test cohort. Beyond the error, they calculate a bunch of model properties that show up as tables below results. The AssScalar R6 class from patroklos is the working horse of this directory.
  • src/analyze holds the scripts that unfreeze the respective test data for all models to do some meta analysis about the trained models and the validation: reporting the non-zero coefficients of the picked models, plots on thresholding the continuous output of the picked models and plots on validation versus test error for all models. The Ass2d R6 class from patroklos and patroklos::val_vs_test() are the stars of this directory.

About

My master thesis conducted at the Chair of Statistical Bioinformatics at the University of Regensburg

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors