Skip to content

feat(executor): pass GPU device requests to spawned method containers#94

Open
Burhanuddin98 wants to merge 2 commits into
choras-org:devfrom
Burhanuddin98:feat/gpu-passthrough-method-containers
Open

feat(executor): pass GPU device requests to spawned method containers#94
Burhanuddin98 wants to merge 2 commits into
choras-org:devfrom
Burhanuddin98:feat/gpu-passthrough-method-containers

Conversation

@Burhanuddin98

Copy link
Copy Markdown

Summary

Adds device_requests=[DeviceRequest(count=-1, capabilities=[["gpu"]])] to the client.containers.run() call in LocalExecutor. The backend itself stays CPU-only; only the spawned solver method containers receive GPU passthrough.

Motivation

The workshop methods that benefit from GPU acceleration (Hamilton's PFFDTD via the c_cuda binary, edg-acoustics' CuPy path) cannot reach the host GPU without this — the backend dispatches via the Docker socket and `containers.run()` defaults to no device requests.

What this PR does

One block of changes in `app/services/executors/local_executor.py`:

```python
device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[["gpu"]])
],
```

Verification

End-to-end run on Windows + Docker Desktop + RTX 2060 Max-Q:

  • `nvidia-smi` inside the spawned method container correctly reports the host GPU.
  • A GPU-compiled PFFDTD c_cuda binary (Hamilton's `fdtd_main_gpu_double.x`) executes inside the spawned container; GPU utilization tracks the kernel during the run (`~40%` on the chosen mesh).
  • On a host without `nvidia-container-toolkit`, the device request is silently ignored and the container runs CPU-only — existing CPU-only methods (`pyroomacoustics`) keep working unchanged.

Open question for maintainers

`count=-1` (all visible GPUs) is unconditional here. A possible refinement is to read a per-method `requiresGpu` flag from `methods-config.json` and use `count=1` instead, sparing CPU-only methods a CUDA context reservation on multi-GPU laptops. Left out of this PR to keep the change minimal and let the design choice rest with the maintainers.

Test plan

  • `nvidia-smi` inside spawned container reports GPU
  • PFFDTD c_cuda binary runs to completion inside spawned container
  • CPU-only method (pyroomacoustics-style) still works on the same host
  • Pending maintainer review for the count=-1 / per-method-flag design choice

Adds device_requests=[DeviceRequest(count=-1, capabilities=[['gpu']])] to client.containers.run() in LocalExecutor. The backend itself stays CPU-only; only the spawned solver method containers receive GPU passthrough.

On hosts with nvidia-container-toolkit installed, each spawned container sees the host GPU. Verified end-to-end with a GPU-compiled PFFDTD c_cuda binary: nvidia-smi inside the spawned container reports the device and GPU utilization tracks the solver kernel during the run.

On hosts without the toolkit, the request is silently ignored and the container runs CPU-only -- so existing CPU-only methods like pyroomacoustics keep working unchanged.

Open question for review: count=-1 (all visible GPUs) is unconditional here. A future refinement could read a per-method requiresGpu flag from methods-config.json and use count=1 instead, sparing CPU-only methods a CUDA context reservation on multi-GPU laptops. Kept out of this PR to minimize the change surface and let maintainers steer the design.
@mberz mberz moved this from Backlog to Require review in CHORAS planning May 12, 2026
@mberz mberz self-requested a review May 12, 2026 08:54

@mberz mberz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR.

I've just tested it on macOS with arm and receive the following error when trying to boot the container:

backend-1 | [2026-05-12 10:53:34,419: ERROR/MainProcess] Failed to start Docker container: 500 Server Error for http+docker://localhost/v1.52/containers/8a02c31369a7b5430a1a5c42bfd9aab75ffb7561359f50eb638e5bd61e5df4d0/start: Internal Server Error ("could not select device driver "" with capabilities: [[gpu]]")

It seems that in case where gpu capability is not there, the containers fail.
Do you have a solution for that?

Addresses Marco's review on PR choras-org#94 (macOS arm test):

> [2026-05-12 10:53:34,419: ERROR/MainProcess] Failed to start Docker
> container: 500 Server Error [...] ("could not select device driver ""
> with capabilities: [[gpu]]")

My original claim that the device request would be silently ignored on
hosts without nvidia-container-toolkit was wrong. The Docker daemon
returns a hard 500 instead. Apple Silicon, macOS, Linux without nvidia
runtime, and Windows without WSL+nvidia all hit this.

Fix: try the GPU-on containers.run() first; on the specific
"could not select device driver" / [[gpu]] APIError, log a warning and
retry the run() without device_requests. Any other 500 (port conflict,
image-pull failure, OOM, etc.) still propagates as before.

Result:
  - Hosts with nvidia-container-toolkit: spawned container sees GPU
    (verified end-to-end with PFFDTD c_cuda binary on RTX 2060).
  - Hosts without GPU: spawned container runs CPU-only, no user-visible
    change vs the pre-PR behaviour. Backend logs one warning per spawn.
@Burhanuddin98

Copy link
Copy Markdown
Author

Hi Marco, thanks for catching that on macOS arm. You were right: my "silently ignored" assumption was wrong; the daemon returns a hard 500 instead. Pushed commit 0ace611 to the same branch. It tries the GPU request first and catches the specific could not select device driver ... [[gpu]] APIError, retrying CPU-only. Any other 500 (port conflict, image pull, OOM, etc.) still propagates as before. Verified on the Linux+nvidia path that GPU passthrough still works end-to-end with the PFFDTD c_cuda binary. Ready for re-review.

@mberz mberz linked an issue Jun 18, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Require review

Development

Successfully merging this pull request may close these issues.

ENH: GPU support for simulation methods

2 participants