The primary functions of this repository include:
- Data Preprocessing: Parsing raw AKTA result files (including
.res,.zip,.csv, and.xlsxformats) to extract relevant experimental parameters and chromatogram data. - Feature Engineering: Transforming the extracted data into structured feature sets suitable for machine learning, including calculating peak metrics and collating run information.
- Predictive Modeling: Building and training machine learning models to predict purification outcomes (e.g., "Total Capsids") based on process parameters.
- Bayesian Optimization: Utilizing the trained model a surrogate within the BayBE framework to intelligently recommend new experimental conditions aimed at optimizing purification yield and purity.
$ conda env create --name viral --file=viral.yml
$ conda activate viral
This folder contains scripts used to parse and extract relevant data from historical chromatography runs, primarily from AKTA systems. Key scripts include:
extract_matrix.py: Extracts features from.csvfiles (likely exported from AKTA results).extract_xlsx_data.py: Extracts features from.xlsxfiles.extract_peaks.py&extract_xlsx_peaks.py: Extract peak-specific metrics from.csvand.xlsxfiles respectively.pycorn-bin.py: A utility (likely based on the PyCORN library) used for processing AKTA.resor.zipresult files, potentially converting them to.xlsxor.csv. Seeextract_zips.shfor usage example.utils.py&utils_xlsx.py: Contain helper functions for data loading, parsing filenames (e.g.,get_resin_and_serotype,get_column_volume), calculating metrics, and potentially plotting (show_peaks).
The goal of these scripts is to generate structured datasets suitable for machine learning.
This folder contains scripts for building predictive models based on the data extracted in the data_extraction phase.
train.py: Script to train models (specifically Gaussian Process models as indicated). It loads processed data, splits it, potentially scales features (log_transform), trains a model (train_gp_model), makes predictions (gp_predict), and evaluates performance (get_metrics).models.py: Likely defines the model architectures and training/prediction functions (e.g., Gaussian Process related functions).utils.py: Contains helper functions for modeling, such as transformations (log_transform,inverse_log_transform) and metric calculation (get_metrics).ML_analysis.ipynb: A Jupyter Notebook for exploratory data analysis and potentially model experimentation.
This folder contains scripts related to using the trained models for Bayesian optimization campaigns to suggest new experimental conditions.
surrogate_model.py: Defines the Gaussian Process surrogate model (gp_model) used within the optimization framework (BayBE). It specifies the kernel structure (usingDotProductKernel,RQKernel,MaternKernel).- Each folder contains the serotypes that were purified in the downstream optimization campaign (AAV2, AAV5, AAV9).
- Data Extraction: Raw AKTA result files (e.g.,
.res,.zip) are processed using scripts indata_extraction(usingpycorn-bin.pyviaextract_zips.sh) to generate intermediate.csvor.xlsxfiles. - Feature Engineering:
extract_matrix.py,extract_xlsx_data.py, andextract_peaks.pyparse these intermediate files to create feature matrices and target variable datasets. - Model Training: The
Modeling/train.pyscript uses the generated datasets to train models. - Optimization: The Gaussian process model (
surrogate_model.py) is used as a surrogate inOptimization campaignsscripts (e.g.,AAV2_AAVA3_campaign.py) to recommend new experimental conditions expected to optimize the target ("Total Capsids").