PROFET reconstructs continuous gene expression dynamics from static, time-stamped single-cell RNA sequencing (scRNA-seq) snapshots. Unlike conventional methods that rely on discrete timepoints or assume linear transitions, PROFET models cell state evolution as a principled generative process. It has been validated on both synthetic and experimental datasets and applied to uncover treatment-induced heterogeneity in breast cancer. By recovering dynamic expression trajectories from static scRNA-seq data, PROFET provides a scalable and principled tool for modeling cell state transitions in development, disease, and therapeutic response.
-
Step 1 — Particle transport (GPA): constructs optimal transport plans between empirical distributions at consecutive timepoints using a Lipschitz-regularised KL divergence minimisation, producing temporally smooth and distribution-consistent particle trajectories (
run_GPA.py, TensorFlow). -
Step 2 — Force matching: fits a time-dependent neural ODE velocity field to the particle flows from Step 1, yielding a continuous global vector field (
run_ForceMatching.py, TensorFlow). At inference, the fitted field is loaded viamodels/velocityfield.py(PyTorch) and integrated with a forward-Euler ODE solver. -
Step 3 — Downstream analysis: the inferred continuous trajectory is used for four types of biological analysis, all implemented in
util/downstream.py:- Trajectory visualisation and subtrajectory classification: reconstructed cell trajectories are visualised in PCA space and classified into subgroups based on either fate (target time point clustering) or ancestral state (source time point clustering), revealing distinct cell fate decisions.
- Gene expression dynamics (EMT, mESC): per-gene expression is reconstructed over continuous time from the trajectory, enabling comparison of dynamic gene programmes across subtrajectories via average dynamics, violin plots, fold change, and KDE distribution comparisons at held-out intermediate timepoints.
- Phenotypic shift heterogeneity (breast cancer datasets): cells are classified into Low / Medium / High phenotypic shift groups based on displacement in PCA space before and after treatment, and per-gene expression dynamics are reconstructed within each group to characterise transcriptional diversity in treatment response.
- Fate analysis (LARRY, Axolotl): cell fates are predefined from published studies. For LARRY, fate labels are provided at day 6 for three groups (neutrophils, monocytes, and others); for Axolotl, at day 7 for four groups (BE, IE, RCP, and CT cells). Each inferred trajectory is assigned to the nearest fate group centroid at the corresponding reference timepoint (day 6 for LARRY, day 7 for Axolotl), and per-gene expression dynamics are reconstructed within each fate class.
git clone https://github.com/HyeminGu/PROFET.git
cd PROFET
pip install -r requirements.txtKey dependencies: torch, tensorflow, geomloss, scikit-learn, numpy, pandas, matplotlib, seaborn, scipy.
PROFET/ ← project root
│
├── PROFET/ ← core code (lib_dir in notebooks)
│ ├── run_GPA.py ← Step 1: particle transport
│ ├── run_ForceMatching.py ← Step 2: velocity field training
│ ├── models/
│ │ ├── velocityfield.py ← PyTorch VelocityField (load + ODE integrate)
│ │ └── discriminator.py ← GPA discriminator network
│ └── configs/
│ ├── GPA.yaml ← default GPA hyperparameters
│ └── GPA-Toy1.yaml ← toy-data config
│
├── util/ ← shared Python utilities
│ ├── utils.py ← data I/O, PCA, animation, W2 metric
│ └── downstream.py ← all downstream analysis functions
│ (gene dynamics, trajectory visualisation,
│ subtrajectory classification)
│
├── notebooks/ ← one self-contained notebook per dataset
│ ├── Emt_72.ipynb ← EMT
│ ├── Stem_cell_differentiation.ipynb ← mESC
│ ├── MCF7_Cell_Line.ipynb ← MCF7 breast cancer cell line
│ ├── Patient_PA3.ipynb ← Patient PA3 (BMC cohort)
│ ├── Patient_862.ipynb ← Patient 862 (NatMed cohort)
│ ├── Patient_887.ipynb ← Patient 887 (NatMed cohort)
│ ├── Synthetic.ipynb ← synthetic trajectory benchmark
│ ├── LARRY_3000_benchmark.ipynb ← LARRY dataset benchmark
│ ├── Axolotl_data_2000.ipynb ← Axolotl limb regeneration
│ └── OU_process-GPA.ipynb ← Ornstein-Uhlenbeck toy example
│
├── benchmarks/ ← benchmark models for comparison
│ ├── cellot/
│ ├── DeepRUOT/
│ ├── MIOFlow/
│ ├── MMFM/
│ ├── PI-SDE/
│ ├── prescient/
│ ├── TIGON/
│ ├── TrajectoryNet/
│ └── VGFM/
│
├── data/ ← raw data and preprocessed .pkl files
│ (not included in the repository)
├── assets/ ← outputs: GIFs, plots, model weights
├── requirements.txt
├── LICENSE
└── README.md
PROFET has been applied and benchmarked across nine datasets spanning multiple biological contexts:
| Notebook | Dataset | Context |
|---|---|---|
Emt_72.ipynb |
EMT (72 genes, 12,588 cells) | Epithelial-to-mesenchymal transition; 6 timepoints (days 0–8); trains on days 0, 4; holds out days 1, 2, 3, 8 |
Stem_cell_differentiation.ipynb |
mESC differentiation (100 genes, 456 cells) | Mouse embryonic stem cell differentiation; 5 timepoints (days 0–4); trains on days 0, 2, 4; holds out days 1, 3 |
MCF7_Cell_Line.ipynb |
MCF7 breast cancer cell line (117 genes, 14,160 cells) | Palbociclib treatment response (NDPR cohort); day 0 → day 1 |
Patient_PA3.ipynb |
Patient PA3 (116 genes, 4,692 cells) | Palbociclib treatment (BMC cohort); day 0 → day 1 |
Patient_862.ipynb |
Patient 862 (115 genes, 17,260 cells) | Palbociclib treatment (NatMed cohort); day 0 → day 1 |
Patient_887.ipynb |
Patient 887 (115 genes, 10,174 cells) | Palbociclib treatment (NatMed cohort); day 0 → day 1 |
LARRY_3000_benchmark.ipynb |
LARRY (3,000 genes, 49,302 cells) | Lineage-tracing benchmark; 3 timepoints (days 2, 4, 6); trains on days 2, 6; holds out day 4 |
Axolotl_data_2000.ipynb |
Axolotl limb regeneration (2,000 genes, 18,648 cells) | 5 timepoints (days 0–4); trains on days 0, 2, 4; holds out days 1, 3 |
Synthetic.ipynb |
Synthetic trajectory (26 genes, 1,195 cells) | 5 timepoints (days 0–4); trains on days 0, 2, 4; holds out days 1, 3 |
An Ornstein-Uhlenbeck toy example (OU_process-GPA.ipynb) is also provided.
Each notebook is self-contained and walks through the full pipeline for one dataset.
notebooks/<Dataset>.ipynb
│
├── 1. Preprocessing
│ Input: raw gene expression matrix (.txt) + cell time annotation (.txt)
│ Output: preprocessed dataset saved as data/<name>_preprocessed.pkl
│ PCA variance ratio plot saved to data/
│
├── 2. PROFET
│ Step 1 (GPA)
│ Input: preprocessed .pkl (projected PCA coordinates)
│ Output: GPA transport plan saved as assets/<name>/KL-Lipschitz_...pickle
│ Step 2 (Force Matching)
│ Input: GPA .pickle file(s) from Step 1
│ Output: velocity field weights + hyperparameters saved to assets/<name>/<exp_memo>/
│ ODE integration
│ Input: velocity field from assets/<name>/<exp_memo>/
│ Output: X1_trpts — list of cell positions at each time step
│
├── 3. Trajectory Visualization & Subtrajectory Classification
│ Input: X1_trpts, pca, mats (per-timepoint expression matrices)
│ Output: static trajectory plots (.png, with/without snapshots)
│ animated subtrajectory GIFs (.gif)
│ cluster label CSV ({exp_memo}_X1_hat_clusters.csv or _X2_hat_clusters.csv)
│
└── 4. Downstream Analysis
EMT / mESC
Input: X1_trpts, cluster label CSV, gene expression matrices
Output: per-gene average dynamics plots, violin plots by subtrajectory,
fold change / p-value CSVs and plots, single-cell trajectory plots,
KDE distribution comparisons at intermediate timepoints
Breast cancer (MCF7 / PA3 / 862 / 887)
Input: X1_trpts, gene expression matrices
Output: displacement distribution plots and CSVs,
Low / Medium / High phenotypic shift classification plots,
per-gene single-cell dynamics by shift class
LARRY / Axolotl
Input: X1_trpts, predefined fate labels (neutrophils / monocytes / others for LARRY;
BE / IE / RCP / CT for Axolotl), gene expression matrices
Output: fate-classified subtrajectory plots,
per-gene dynamics by fate class
Raw datasets are available for download at: https://drive.google.com/drive/folders/1ba-skCOxvosDQTWz1Rq3GlClk-NH8-eV
Preprocessed datasets (.pkl files) are available with CC By 4.0 for download at:
https://doi.org/10.5281/zenodo.21014564
Place each dataset under data/:
| Dataset | Pickle file | Timepoints | Genes | Total cells | Training tp | Held-out |
|---|---|---|---|---|---|---|
| EMT | emt_72_preprocessed.pkl |
0, 1, 2, 3, 4, 8 | 72 | 12,588 | 0, 4 | 1, 2, 3, 8 |
| Stem cell differentiation (mESC) | stem_cell_differentiation_preprocessed.pkl |
0, 1, 2, 3, 4 | 100 | 456 | 0, 2, 4 | 1, 3 |
| MCF7 cell line | MCF7_Cell_Line_preprocessed.pkl |
0, 1 | 117 | 14,160 | 0, 1 | — |
| Patient PA3 | Patient_PA3_preprocessed.pkl |
0, 1 | 116 | 4,692 | 0, 1 | — |
| Patient 862 | Patient_862_preprocessed.pkl |
0, 1 | 115 | 17,260 | 0, 1 | — |
| Patient 887 | Patient_887_preprocessed.pkl |
0, 1 | 115 | 10,174 | 0, 1 | — |
| LARRY benchmark | LARRY_3000_benchmark_preprocessed.pkl |
2, 4, 6 | 3,000 | 49,302 | 2, 6 | 4 |
| Axolotl limb regeneration | Axolotl_data_2000_preprocessed.pkl |
0, 1, 2, 3, 4 | 2,000 | 18,648 | 0, 2, 4 | 1, 3 |
| Synthetic | synthetic_preprocessed.pkl |
0, 1, 2, 3, 4 | 26 | 1,195 | 0, 2, 4 | 1, 3 |
| Function | Description |
|---|---|
load_preprocessed_data |
Load a saved .pkl dataset |
save_preprocessed_data |
Save preprocessed data to .pkl |
reduce_dimension |
Fit full-rank PCA and save variance plot |
visualize_data |
Per-timepoint 2D PCA scatter plots |
generate_animation |
Animated GIF of trajectory + optional vector field |
generate_W2distance_plot |
W₂ distance between predicted trajectory and data over time |
W2 |
Sinkhorn W₂ between two sample sets |
save_trajectories |
Save a list of trajectory snapshots to a .pkl file |
ResourceMonitor |
Context manager measuring wall-clock time and peak GPU / CPU memory |
Contains all downstream analysis and visualization functions, organized in two sections:
Gene Expression Dynamics
Average_gene_dynamics_whole_saveonly— mean trajectory with 95 % CIAverage_gene_dynamics_whole_saveonly_with_violin_plot_sample1_EMT— violin plots by subtrajectory (EMT)Average_gene_dynamics_whole_saveonly_with_violin_plot_sample_3_stem— violin plots by subtrajectory (mESC)Average_gene_dynamics_whole_saveonly_single_trajectory_EMT/mESC— single-cell trajectoriesAverage_gene_dynamics_whole_saveonly_single_trajectory_Axolotl— single-cell trajectories (Axolotl)Average_gene_dynamics_whole_saveonly_with_violin_plot_Axolotl— violin plots by subtrajectory (Axolotl)Average_gene_dynamics_whole_saveonly_single_trajectory_NDPR_breast_cancer— single-cell (MCF7)Average_gene_dynamics_whole_saveonly_single_trajectory_clinical— single-cell (PA3, 862, 887)Compute_and_Plot_FoldChange_MeanDiff_PValues— fold change, mean difference, p-valuesdifference_of_means_emt / difference_of_means_stem— between-subgroup statisticsCompare_Distribution_Trajectories_Intermediate_EMT/mESC— KDE comparisons at intermediate timesplot_X1_hat_displacement_distribution— displacement histogram (breast cancer)generate_static_cluster_plot_deviation_colormap_MCF7/PA3/862/887— phenotypic shift classification
Trajectory Visualization & Subtrajectory Classification
generate_static_trajectory_plots_three_timepoints— static plots, 3 training timepointsgenerate_static_trajectory_plots_two_timepoints— static plots, 2 training + 1 testgenerate_static_trajectory_plots_two_timepoints_no_middle— static plots, 2 training, no testgenerate_static_cluster_plot_target— static subtrajectory plot, clustered by fategenerate_static_cluster_plot_target_with_dfcluster_selected_clusters— fate-classified plot using pre-labelled clusters with optional subset selectiongenerate_static_cluster_plot_target_LARRY_benchmark_fate— fate-classified subtrajectory plot for the LARRY benchmarkgenerate_static_cluster_plot_source— static subtrajectory plot, clustered by ancestorclassify_X1_hat— animated fate classificationclassify_X2_hat— animated ancestral classification
Nine trajectory inference methods are included under benchmarks/ for comparison:
cellot, DeepRUOT, MIOFlow, MMFM, PI-SDE, prescient, TIGON, TrajectoryNet, VGFM.
Figure 2 and Supplementary Table 4 compare performances of PROFET and benchmark models. All benchmark models are evaluated on a shared set of datasets. Each entry below lists the PCA dimensions tested and the models evaluated on that dataset.
-
Stem cell differentiation (mESC) — 456 cells, 100 genes; train on days 0, 2, 4; hold out days 1, 3
- Dimensions: 2, 4, 8, 16
- Models: MMFM, MIOFlow (PCA / GAE / PHATE), prescient, VGFM, DeepRUOT, PI-SDE, TIGON, TrajectoryNet
-
EMT (72 genes) — 12,588 cells; train on days 0, 4; hold out day 2
- Dimensions: 2
- Models: MMFM, MIOFlow (PCA / GAE / PHATE), prescient, VGFM, DeepRUOT, PI-SDE, TIGON, TrajectoryNet, cellot
-
LARRY benchmark (3,000 genes) — 49,302 cells; train on days 2, 6; hold out day 4
- Dimensions: 2
- Models: MMFM, MIOFlow (PCA / GAE / PHATE), prescient, VGFM, DeepRUOT, PI-SDE, TIGON, TrajectoryNet, cellot
-
Axolotl limb regeneration (2,000 genes) — 18,648 cells; train on days 0, 2, 4; hold out days 1, 3
- Dimensions: 2
- Models: MMFM, MIOFlow (PCA / GAE / PHATE), prescient, VGFM, DeepRUOT, PI-SDE, TIGON
Cellot is further evaluated on 2-time-points datasets.
-
MCF7 breast cancer cell line — 14,160 cells, 117 genes; day 0 → day 1, Dimensions: 2
-
Patient 862 — 17,260 cells, 115 genes; day 0 → day 1, Dimensions: 2
-
Patient 887 — 10,174 cells, 115 genes; day 0 → day 1, Dimensions: 2
-
Patient PA3 — 4,692 cells, 116 genes; day 0 → day 1, Dimensions: 2
Each benchmark model has a run.sh script that lists all experiments. Data is loaded from the shared data/ directory and results are saved to assets/<model_name>/.
cd benchmarks/<model_name>
# Run all experiments for a model
./run.sh
# Run a single experiment by ID
./run.sh 3Notebook-based models (MIOFlow, prescient, VGFM, DeepRUOT, TrajectoryNet) use papermill to inject parameters and execute notebooks from the command line. Script-based models (MMFM, PI-SDE, TIGON, cellot) accept parameters directly via argparse. See each model's run.sh for the full list of experiments and their configurations.
If you use PROFET in your research, please cite:
@article{cheng2026profet,
title={PROFET Predicts Continuous Gene Expression Dynamics
from scRNA-seq Data to Elucidate Resistance to Cancer Therapy},
author={Cheng YC, Gu H, McDonald TO, Wu W, Tripathi S, Guarducci C, Russo D, Abravanel DL, Bailey M, Wang Y, Zhang Y, Pantazis Y, Levine H, Jeselsohn R, Katsoulakis MA, Michor F},
journal={Cell Systems},
note={In press},
year={2026}
}