[pull] master from ray-project:master by pull[bot] · Pull Request #1075 · garymm/ray

pull · 2026-06-16T01:18:15Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

## Why are these changes needed? Custom search algorithms (OptunaSearch, HyperOpt, Ax, etc.) run trials **sequentially** even when `max_concurrent_trials` is set, because `_get_max_pending_trials()` hardcodes `max_pending_trials=1` for all non-`BasicVariantGenerator` searchers. When a user sets `max_concurrent_trials=N`, the searcher is wrapped in `ConcurrencyLimiter(searcher, max_concurrent=N)`, but the pending trial limit ignores this and still returns 1. This means the scheduler only ever queues one trial at a time, regardless of available cluster resources. The only workaround is the undocumented env var `TUNE_MAX_PENDING_TRIALS_PG`, which most users don't know about. ### Verified impact (Ray 2.54.1, 4 workers, 16 trials × 2s sleep, TPE sampler): | Config | Time | Result | |--------|------|--------| | `max_concurrent_trials=8` (before fix) | 60.4s | sequential | | `max_concurrent_trials=8` (after fix) | 12.0s | **parallel** | ## Changes In `_get_max_pending_trials()`: when the search algorithm wraps a `ConcurrencyLimiter` (via `max_concurrent_trials`), extract and use its `max_concurrent` value as the pending trial limit. Without `max_concurrent_trials`, behavior is unchanged (`return 1`). The `TUNE_MAX_PENDING_TRIALS_PG` env var override continues to work. **Total change: 3 lines of logic + 1 import.** ## Related issue numbers Fixes optuna/optuna#5611 Related: optuna/optuna#5958 Related: #48547 ## Checks - [x] I've signed all my commits with `--signoff` (DCO) - [x] I've run `pre-commit run ruff -a` — all hooks pass - [x] I've added tests in `test_controller_search_alg_integration.py` - [x] Changes are backward compatible (no behavior change without `max_concurrent_trials`) --------- Signed-off-by: Sebastian Schwartz <sebastian.schwartz@chicagotrading.com> Co-authored-by: Sebastian Schwartz <sebastian.schwartz@chicagotrading.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Description The way we got the per process used gpu memory that's displayed on the actor dashboard was wrong. <img width="3058" height="834" alt="image" src="https://github.com/user-attachments/assets/34e87301-4f96-48c2-8145-886dbbd37c64" /> We would use memutil from nvmlDeviceGetProcessesUtilizationInfo and then divide by 100 and multiply by the total memory on the gpu. But memutil is actually the percentage of time the memory bus is busy, not the percentage of memory used by the process... Someone else ran into this same misunderstanding in the past - https://forums.developer.nvidia.com/t/memory-metric-reported-by-nvml-and-nvdia-smi-seems-to-differ/75282 Nvidia docs (https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t%5B/url%5D) define memutil as ``` Percent of time over the past sample period during which global (device) memory was being read or written. ``` The fallback actually got the used memory for the process correctly, so now I'm just using that to get the used memory while I use the newer api to just get sm utilization. Fixes #61456 --------- Signed-off-by: dayshah <dhyey2019@gmail.com>

…a vLLM plugin (#64067) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…64109) Multi-GPU tests (core and rllib) on the T4 fleet fail at the first NCCL collective with CUDA error 217 (cudaErrorPeerAccessUnsupported). NCCL's cuMem path (default since 2.24) attempts GPU peer access during transport setup, which is unsupported on T4 and errors out on the current driver. Set NCCL_CUMEM_HOST_ENABLE=0 in the shared GPU base image to force the legacy host-buffer path, restoring multi-GPU collectives. Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…64059) ## Problem The `download("image_url")` example in `doc/source/data/loading-data.rst` downloads ImageNet JPEGs from the public `ray-example-data` bucket using **ambient AWS credentials**. On CI the PR-runner role lacks `s3:GetObject` on that bucket, so the download is denied: ``` <Error><Code>AccessDenied</Code> ... not authorized to perform: s3:GetObject on arn:aws:s3:::ray-example-data/imagenet/train/n01440764/n01440764_10026.JPEG FAILED doc/source/data/loading-data.rst::loading-data.rst - SystemExit: 15 //doc:source/data/loading-data TIMEOUT in 3 out of 3 in 303.2s ``` This fails the `:database: data: doc tests` premerge job on **every PR that actually re-runs that doctest** (i.e. a Bazel cache miss). Confirmed identical across unrelated PRs — premerge builds 68131, 68134, 68140. ## Fix Every other read on the page already uses anonymous access (`s3://anonymous@…`), but the image URLs are baked into the metadata parquet without it. Pass an anonymous pyarrow S3 filesystem to `download()` so the example reads the public bucket **unsigned**, independent of runner credentials. Prerequisite (fixes anonymous downloads): #64089 Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

saschwartz and others added 5 commits June 15, 2026 15:13

[serve.llm] MoRIIO cross-node: advertise worker node-internal IP via …

357b390

…a vLLM plugin (#64067) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pull Bot locked and limited conversation to collaborators Jun 16, 2026

pull Bot added the ⤵️ pull label Jun 16, 2026

pull Bot merged commit 4adc6e6 into garymm:master Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#1075

[pull] master from ray-project:master#1075
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

pull Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pull Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull Bot commented Jun 16, 2026 •

edited

Loading