Final Project for Network Machine Lerarning course at EPFL (EE-452) Authors:
- Matteo Santelmo - SCIPER: 376844
- Stefano Viel - SCIPER: 377251
The repository is structured as follows:
data/: contains the original dataset used for the project and is used to store the processed data.notebooks/: contains the Jupyter notebooks used for the project:data_exploration.ipynbprovides some insights on the dataset.baselines.ipynbcontains the code used to train and evaluate the two baselines.results_analysis.ipynbcontains the code used to analyze the results of the models obtained via grid search and the final experiments.
src/: contains the source code of the project, in particular:models.pycontains the implementation of the GCN-based Encoder-Decoder architecture used for the project.evaluation_metrics.pycontains the implementation of the evaluation metrics.matrix_factorization.pyimplements the Matrix Factorization baseline.
scripts/: contains the scripts used to run the experiments.report/: contains the final report of the project.
First of all you need to install the required packages. We recommend to create a virtual environment an install the packages there. You can do so by running the following commands:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtIf any problem arises during the installation or later, we recommend following the precise instructions on PyTorch and PyTorch Geometric websites as the installation of these packages might depend on system configuration.
Now you can run all the code by using the scripts provided in the scripts/ folder. By running any python script with the --help flag you can see the available options.
To create and store both the Heterogeneous Graph and the training-validation-test splits you can use:
mkdir -p ./data/splitted_data
python scripts/create_datasets.py --save_dir ./data/splitted_data
# by adding the --add_extra_data option, the graph will also contain authors and language nodesTo train a model you can run scripts/trainer.py with appropriate arguments. This would automatically create a folder in the specified output directory containing the model file (both the last and the best), the TensorBoard logs and a configuration file with the hyperparameters used. For example:
python scripts/trainer.py \
--data_path ./data/splitted_data \
--output_dir ./output \
--num_conv_layers 2 \
--hidden_channels 256 \
--num_decoder_layers 3\
--sampler_type link-neighbor \
--num_epochs 10 \
--batch_size 1024 \
--encoder_arch SAGE \
--validation_steps -1 \
--lr 0.00025 \
--loss mse \
--device cuda:0 \
--verboseFinally, to evaluate your models you can use scripts/evaluator.py with appropriate arguments depending on where your model and data are stored. This script will create a metrics.json file in the model folder containing the values for the evaluation metrics.
python scripts/evaluator.py \
--model_folder ./output \
--data_folder ./data/splitted_data \
# adding --evaluate_last the evaluator will consider the last model instead of the best oneIn this example.sh you can find a script that runs the whole pipeline with some default parameters and different models.
| Model | MAP@15 | Precision@5 | Recall@5 | F1@5 |
|---|---|---|---|---|
| Random Baseline | 0.471 | 0.472 | 0.332 | 0.379 |
| Matrix Factorization | 0.489 | 0.494 | 0.312 | 0.371 |
| EncDec with SAGE | 0.551 | 0.552 | 0.347 | 0.414 |
| EncDec with SAGE + Additional Nodes |
0.593 | 0.596 | 0.380 | 0.450 |