This guide is a good starting point for the requirements.
- Install docker and sudoless docker. More info on rcp doc on containers and doc on preparing environments
- Install kubernetes
- follow the kubernetes instructions in the wiki.rcp.epfl.ch to install kubernetes
- if running
kubectl versiongives aThe connection to the server localhost:8080 was refused...message, you might need to create a.kube/configfile and runcurl https://wiki.rcp.epfl.ch/public/files/kube-config.yaml -o ~/.kube/config && chmod 600 ~/.kube/configto configure the cluster
- Install runai using the instructions in the wiki
- login to the RunAI platform using
runai login. You should be able to runrunai whoamiafterwards
- login to the RunAI platform using
registry.rcp.epfl.ch- go to registry.rcp.epfl.ch and login
- create your project with the UI. Your project should be
lts4-$USERNAME - login with docker to the registry by
docker login registry.rcp.epfl.ch
- (Optional) Create a wandb secret and name it
wandb-secret. This is needed for the wandb integration. Follow this link: https://wiki.rcp.epfl.ch/en/home/CaaS/FAQ/how-to-use-secret-wandb - For Visual Studio Code integration, follow this link: https://wiki.rcp.epfl.ch/en/home/CaaS/FAQ/how-to-vscode
haas- Make sure you have access to the
haasstorage by runningssh $USERNAME@haas001.rcp.epfl.ch(orssh $USERNAME@jumphost.rcp.epfl.ch, which is the recommended host) - go to your mounted volume (should be
/mnt/lts4/scratchfor most) and create a directory with your name viamkdir -p /mnt/lts4/scratch/home/$USERNAME. The launch script assumes that you have done so.
- Make sure you have access to the
Now you can proceed with the next steps, building your docker image, pushing it to the registry and launching jobs.
First, you must recover and save your LDAP accreditation codes. You can use the ldap_fetch.sh script as follows, where GASPAR is your EPFL username:
./ldap_fetch.sh GASPAR
# Optional (to include wandb): ./ldap_fetch.sh GASPAR --wandbThis will store your credentials in the ~/.profile file, and make them available at startup by sourcing them it to your .bashrc or .zshrc files.
It will also define the RUNAI_OPTIONS environment variable, which will allow you to launch jobs with runai submit.
The base image uses a specific pytorch image for reproducibility, adds several libraries, adds the current user.
If you want to add more template images, create a directory in the dockerfiles directory and add a Dockerfile there.
Then, make a PR.
Then, run the following line to push your image to the registry (if you only want to build the image without pushing it to the registry, omit the push).
# Before running this command, make sure to change $GASPAR to your epfl username, or declare it as
# an environment variable
./publish.sh --path=dockerfiles/base \
--img=NAME_OF_YOUR_IMAGE \
--version=1 \
--push=TrueThe official way to launch and interact with jobs is thought the RunAI command line
interface.
In particular using runai submit, whose available options are documented here.
You need to use the $RUNAI_OPTIONS, which is set in your ~/.profile by the ldap_fetch.sh script.
Remark: If you're not a permanent member of LTS4 (PhD or Postdoc), verify that your
EPFL_SCRATCH_HOMEis correctly set:$ echo $EPFL_SCRATCH_HOME > /mnt/lts4/scratch/students/<gaspar>
# You can specify a fraction of the GPU to use with the `--gpus` flag
runai submit $RUNAI_OPTIONS \
--name <name-job> \
--image registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--gpus 0.8 \
--interactive -- sleep infinitySupposing that you want to launch the script train.py in the scr directory of your scratch home
folder (stored on haas), with arguments --arg1=1 --arg2=2 you can use the following command:
runai submit $RUNAI_OPTIONS \
--name <name-job> \
--gpus 1 \
--image registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--command -- /bin/bash -c 'cd $SCRATCH_HOME && python src/train.py --arg1=1 --arg2=2'More detailed information coming soon, take a look at the launch.py script for now.
To use the launch script from anywhere, you can add an alias to your .bashrc or .zshrc file.
# Add the following line to your .bashrc or .zshrc
# ...for bash
echo 'alias rcplaunch="python /path/to/launch.py"' >> ~/.bashrc
source ~/.bashrc
# ...for zsh
echo 'alias rcplaunch="python /path/to/launch.py"' >> ~/.zshrc
source ~/.zshrcRemark: If you're not a permanent member of LTS4 (PhD or Postdoc), include the flag
--studentin the command lines below.
# You can specify a fraction of the GPU to use with the `--gpus` flag
python launch.py \
--name=<name-job> \
--gpus=0.8 \
--image=registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--interactivepython launch.py \
--name=NAME_OF_JOB \
--gpus=1 \
--cpus=20 \
--image=registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--command='cd path/to/code && python train.py --arg1=1 --arg2=2'python launch.py \
--name=NAME_OF_JOB \
--gpus=1 \
--cpus=20 \
--image=registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--command='cd path/to/code && python train.py --arg1=1 --arg2=2' \
--dry-runThe status of a job can be checked with the command runai logs job-name. If a run fails, runai will launch it again up to 6 times in pods with the name job-name-0-n. To check the logs of a specific run, you can run runai logs job-name --pod job-name-0-n, where n is the number of the pod you want to access.
This guide builds upon https://github.com/epfml/getting-started.