Frequently Asked Questions (FAQ)

This FAQ covers common questions about the EPFL RCP cluster and this setup.

Getting Help

If your question isn't answered here or in the main README and architecture explainer:

Ask colleagues: Reach out on Slack channels #-cluster or #-it
Report issues: Open a ticket to supportrcp@epfl.ch for technical problems
Contribute: Add common problems and solutions to this FAQ!

Storage and File Management

Where should I store my files and training data?

TL;DR: Keep everything in /mloscratch/homes/<your_username>, including training data.

Explanation: The storage system has multiple types:

mloscratch: High-performance storage that can be mounted on pods (use this for everything)
mlodata1: Long-term replicated storage for permanent artifacts (papers, final results)

Only mloscratch can be mounted on pods, so all your code and training data must be there.

See also: File Management guide

How do I move data onto the cluster or between storage systems?

Use the HaaS machine to transfer files between mlodata1 and mloscratch:

ssh <gaspar_username>@haas001.rcp.epfl.ch

See also: HaaS Machine guide

Job Management

My job doesn't show up in `runai list jobs`

Likely cause: Wrong PVC (Persistent Volume Claim) name

This can happen when submitting with an incorrect PVC name. For example, RCP-Prod renamed storage from runai-mlo-$GASPAR_USERNAME-scratch to mlo-scratch.

Solution:

Check the run:ai web interface – your job may still be listed there
Resubmit the job with the correct PVC name

Note

Jobs with wrong PVC names may end up in an unmanageable state and cannot be deleted or stopped. Resubmission is the easiest fix.

My job has been "Pending" for a long time

Possible causes:

Cluster is busy – Wait a bit longer, check the dashboard
Incorrect resources requested
- Verify CPU, memory, and GPU requests are within limits
- Check node type is correct (e.g., don't use G10 on RCP cluster)
- Use runai describe job <name> to see detailed status

Check cluster usage: https://portal.rcp.epfl.ch/

VS Code

VS Code opens an empty window when connecting to my pod

Explanation: VS Code opens the pod's home folder (~/) by default, not scratch.

Solution: Navigate to /mloscratch/homes/<your_username> after connecting.

See also: VS Code guide

Docker Images

Can I create my own Docker images?

Yes! See the Creating Custom Docker Images guide in the main README.

Quick steps:

Get registry access at https://ic-registry.epfl.ch/
Modify docker/Dockerfile
Build and push with docker/publish.sh

csub.py

What are the available csub.py arguments?

See the comprehensive reference in the main README:

csub.py Usage and Arguments

For advanced users: csub.py wraps runai submit and passes most flags 1:1. See the run:ai docs:

Permissions and Errors

I get permission errors for `/mloscratch/hf_cache/...`

Cause: Incorrect user/group permissions or umask settings

Solution:

Verify UID/GID in .env:
- LDAP_UID: Your numeric user ID
- LDAP_GID: 83070 (runai-mlo group)
Set umask:
```
echo "umask 007" >> ~/.zshrc
source ~/.zshrc
```
This ensures group-writable permissions.
Still having issues? Contact #-it or #-cluster on Slack

Background: The Hugging Face cache (HF_HOME=/mloscratch/hf_cache) is shared between users to avoid redundant downloads. Correct permissions are essential for shared access.

GPU and Memory

I keep getting CUDA out of memory errors

Understanding GPU allocation: When you request 1 GPU, you get the full GPU and all its RAM. An OOM error means you're saturating the GPU's memory.

Debugging steps:

Check memory usage:

nvidia-smi  # Basic GPU monitoring
nvtop       # Interactive GPU monitoring

Optimize your code:
- Reduce batch size
- Enable gradient checkpointing
- Use mixed precision training (fp16/bf16)
- Free unused tensors
Use larger GPUs:
- Switch to A100-80GB or H200-140GB
- Example: python csub.py -n job --node_type h200 ...

GPU memory by type:

V100: 40GB
A100-40GB: 40GB
A100-80GB: 80GB
H100: 80GB
H200: 140GB

Still Need Help?

Slack: #-cluster or #-it channels
Support: supportrcp@epfl.ch
Main docs: README, Architecture, Workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently Asked Questions (FAQ)

Getting Help

Storage and File Management

Where should I store my files and training data?

How do I move data onto the cluster or between storage systems?

Job Management

My job doesn't show up in `runai list jobs`

My job has been "Pending" for a long time

VS Code

VS Code opens an empty window when connecting to my pod

Docker Images

Can I create my own Docker images?

csub.py

What are the available csub.py arguments?

Permissions and Errors

I get permission errors for `/mloscratch/hf_cache/...`

GPU and Memory

I keep getting CUDA out of memory errors

Still Need Help?

FilesExpand file tree

faq.md

Latest commit

History

faq.md

File metadata and controls

Frequently Asked Questions (FAQ)

Getting Help

Storage and File Management

Where should I store my files and training data?

How do I move data onto the cluster or between storage systems?

Job Management

My job doesn't show up in runai list jobs

My job has been "Pending" for a long time

VS Code

VS Code opens an empty window when connecting to my pod

Docker Images

Can I create my own Docker images?

csub.py

What are the available csub.py arguments?

Permissions and Errors

I get permission errors for /mloscratch/hf_cache/...

GPU and Memory

I keep getting CUDA out of memory errors

Still Need Help?

My job doesn't show up in `runai list jobs`

I get permission errors for `/mloscratch/hf_cache/...`