feat(executor): pass GPU device requests to spawned method containers#94
feat(executor): pass GPU device requests to spawned method containers#94Burhanuddin98 wants to merge 2 commits into
Conversation
Adds device_requests=[DeviceRequest(count=-1, capabilities=[['gpu']])] to client.containers.run() in LocalExecutor. The backend itself stays CPU-only; only the spawned solver method containers receive GPU passthrough. On hosts with nvidia-container-toolkit installed, each spawned container sees the host GPU. Verified end-to-end with a GPU-compiled PFFDTD c_cuda binary: nvidia-smi inside the spawned container reports the device and GPU utilization tracks the solver kernel during the run. On hosts without the toolkit, the request is silently ignored and the container runs CPU-only -- so existing CPU-only methods like pyroomacoustics keep working unchanged. Open question for review: count=-1 (all visible GPUs) is unconditional here. A future refinement could read a per-method requiresGpu flag from methods-config.json and use count=1 instead, sparing CPU-only methods a CUDA context reservation on multi-GPU laptops. Kept out of this PR to minimize the change surface and let maintainers steer the design.
mberz
left a comment
There was a problem hiding this comment.
Thanks for your PR.
I've just tested it on macOS with arm and receive the following error when trying to boot the container:
backend-1 | [2026-05-12 10:53:34,419: ERROR/MainProcess] Failed to start Docker container: 500 Server Error for http+docker://localhost/v1.52/containers/8a02c31369a7b5430a1a5c42bfd9aab75ffb7561359f50eb638e5bd61e5df4d0/start: Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")
It seems that in case where gpu capability is not there, the containers fail.
Do you have a solution for that?
Addresses Marco's review on PR choras-org#94 (macOS arm test): > [2026-05-12 10:53:34,419: ERROR/MainProcess] Failed to start Docker > container: 500 Server Error [...] ("could not select device driver "" > with capabilities: [[gpu]]") My original claim that the device request would be silently ignored on hosts without nvidia-container-toolkit was wrong. The Docker daemon returns a hard 500 instead. Apple Silicon, macOS, Linux without nvidia runtime, and Windows without WSL+nvidia all hit this. Fix: try the GPU-on containers.run() first; on the specific "could not select device driver" / [[gpu]] APIError, log a warning and retry the run() without device_requests. Any other 500 (port conflict, image-pull failure, OOM, etc.) still propagates as before. Result: - Hosts with nvidia-container-toolkit: spawned container sees GPU (verified end-to-end with PFFDTD c_cuda binary on RTX 2060). - Hosts without GPU: spawned container runs CPU-only, no user-visible change vs the pre-PR behaviour. Backend logs one warning per spawn.
|
Hi Marco, thanks for catching that on macOS arm. You were right: my "silently ignored" assumption was wrong; the daemon returns a hard 500 instead. Pushed commit 0ace611 to the same branch. It tries the GPU request first and catches the specific could not select device driver ... [[gpu]] APIError, retrying CPU-only. Any other 500 (port conflict, image pull, OOM, etc.) still propagates as before. Verified on the Linux+nvidia path that GPU passthrough still works end-to-end with the PFFDTD c_cuda binary. Ready for re-review. |
Summary
Adds
device_requests=[DeviceRequest(count=-1, capabilities=[["gpu"]])]to theclient.containers.run()call inLocalExecutor. The backend itself stays CPU-only; only the spawned solver method containers receive GPU passthrough.Motivation
The workshop methods that benefit from GPU acceleration (Hamilton's PFFDTD via the c_cuda binary, edg-acoustics' CuPy path) cannot reach the host GPU without this — the backend dispatches via the Docker socket and `containers.run()` defaults to no device requests.
What this PR does
One block of changes in `app/services/executors/local_executor.py`:
```python
device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[["gpu"]])
],
```
Verification
End-to-end run on Windows + Docker Desktop + RTX 2060 Max-Q:
Open question for maintainers
`count=-1` (all visible GPUs) is unconditional here. A possible refinement is to read a per-method `requiresGpu` flag from `methods-config.json` and use `count=1` instead, sparing CPU-only methods a CUDA context reservation on multi-GPU laptops. Left out of this PR to keep the change minimal and let the design choice rest with the maintainers.
Test plan