This FAQ covers common questions about the EPFL RCP cluster and this setup.
If your question isn't answered here or in the main README and architecture explainer:
- Ask colleagues: Reach out on Slack channels
#-clusteror#-it - Report issues: Open a ticket to
supportrcp@epfl.chfor technical problems - Contribute: Add common problems and solutions to this FAQ!
TL;DR: Keep everything in /mloscratch/homes/<your_username>, including training data.
Explanation: The storage system has multiple types:
mloscratch: High-performance storage that can be mounted on pods (use this for everything)mlodata1: Long-term replicated storage for permanent artifacts (papers, final results)
Only mloscratch can be mounted on pods, so all your code and training data must be there.
See also: File Management guide
Use the HaaS machine to transfer files between mlodata1 and mloscratch:
ssh <gaspar_username>@haas001.rcp.epfl.chSee also: HaaS Machine guide
Likely cause: Wrong PVC (Persistent Volume Claim) name
This can happen when submitting with an incorrect PVC name. For example, RCP-Prod renamed storage from runai-mlo-$GASPAR_USERNAME-scratch to mlo-scratch.
Solution:
- Check the run:ai web interface – your job may still be listed there
- Resubmit the job with the correct PVC name
Note
Jobs with wrong PVC names may end up in an unmanageable state and cannot be deleted or stopped. Resubmission is the easiest fix.
Possible causes:
-
Cluster is busy – Wait a bit longer, check the dashboard
-
Incorrect resources requested
- Verify CPU, memory, and GPU requests are within limits
- Check node type is correct (e.g., don't use
G10on RCP cluster) - Use
runai describe job <name>to see detailed status
Check cluster usage: https://portal.rcp.epfl.ch/
Explanation: VS Code opens the pod's home folder (~/) by default, not scratch.
Solution: Navigate to /mloscratch/homes/<your_username> after connecting.
See also: VS Code guide
Yes! See the Creating Custom Docker Images guide in the main README.
Quick steps:
- Get registry access at https://ic-registry.epfl.ch/
- Modify
docker/Dockerfile - Build and push with
docker/publish.sh
See the comprehensive reference in the main README:
For advanced users: csub.py wraps runai submit and passes most flags 1:1. See the run:ai docs:
Cause: Incorrect user/group permissions or umask settings
Solution:
-
Verify UID/GID in
.env:LDAP_UID: Your numeric user IDLDAP_GID:83070(runai-mlo group)
-
Set umask:
echo "umask 007" >> ~/.zshrc source ~/.zshrc
This ensures group-writable permissions.
-
Still having issues? Contact
#-itor#-clusteron Slack
Background: The Hugging Face cache (HF_HOME=/mloscratch/hf_cache) is shared between users to avoid redundant downloads. Correct permissions are essential for shared access.
Understanding GPU allocation: When you request 1 GPU, you get the full GPU and all its RAM. An OOM error means you're saturating the GPU's memory.
Debugging steps:
-
Check memory usage:
nvidia-smi # Basic GPU monitoring nvtop # Interactive GPU monitoring
-
Optimize your code:
- Reduce batch size
- Enable gradient checkpointing
- Use mixed precision training (fp16/bf16)
- Free unused tensors
-
Use larger GPUs:
- Switch to A100-80GB or H200-140GB
- Example:
python csub.py -n job --node_type h200 ...
GPU memory by type:
- V100: 40GB
- A100-40GB: 40GB
- A100-80GB: 80GB
- H100: 80GB
- H200: 140GB
- Slack:
#-clusteror#-itchannels - Support: supportrcp@epfl.ch
- Main docs: README, Architecture, Workflows