Skip to content

[pull] master from ray-project:master#1075

Merged
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master
Jun 16, 2026
Merged

[pull] master from ray-project:master#1075
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

Conversation

@pull

@pull pull Bot commented Jun 16, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

saschwartz and others added 5 commits June 15, 2026 15:13
## Why are these changes needed?

Custom search algorithms (OptunaSearch, HyperOpt, Ax, etc.) run trials
**sequentially** even when `max_concurrent_trials` is set, because
`_get_max_pending_trials()` hardcodes `max_pending_trials=1` for all
non-`BasicVariantGenerator` searchers.

When a user sets `max_concurrent_trials=N`, the searcher is wrapped in
`ConcurrencyLimiter(searcher, max_concurrent=N)`, but the pending trial
limit ignores this and still returns 1. This means the scheduler only
ever queues one trial at a time, regardless of available cluster
resources.

The only workaround is the undocumented env var
`TUNE_MAX_PENDING_TRIALS_PG`, which most users don't know about.

### Verified impact (Ray 2.54.1, 4 workers, 16 trials × 2s sleep, TPE
sampler):

| Config | Time | Result |
|--------|------|--------|
| `max_concurrent_trials=8` (before fix) | 60.4s | sequential |
| `max_concurrent_trials=8` (after fix) | 12.0s | **parallel** |

## Changes

In `_get_max_pending_trials()`: when the search algorithm wraps a
`ConcurrencyLimiter` (via `max_concurrent_trials`), extract and use its
`max_concurrent` value as the pending trial limit. Without
`max_concurrent_trials`, behavior is unchanged (`return 1`). The
`TUNE_MAX_PENDING_TRIALS_PG` env var override continues to work.

**Total change: 3 lines of logic + 1 import.**

## Related issue numbers

Fixes optuna/optuna#5611
Related: optuna/optuna#5958
Related: #48547

## Checks

- [x] I've signed all my commits with `--signoff` (DCO)
- [x] I've run `pre-commit run ruff -a` — all hooks pass
- [x] I've added tests in `test_controller_search_alg_integration.py`
- [x] Changes are backward compatible (no behavior change without
`max_concurrent_trials`)

---------

Signed-off-by: Sebastian Schwartz <sebastian.schwartz@chicagotrading.com>
Co-authored-by: Sebastian Schwartz <sebastian.schwartz@chicagotrading.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Description
The way we got the per process used gpu memory that's displayed on the
actor dashboard was wrong.
<img width="3058" height="834" alt="image"
src="https://github.com/user-attachments/assets/34e87301-4f96-48c2-8145-886dbbd37c64"
/>

We would use memutil from nvmlDeviceGetProcessesUtilizationInfo and then
divide by 100 and multiply by the total memory on the gpu. But memutil
is actually the percentage of time the memory bus is busy, not the
percentage of memory used by the process...

Someone else ran into this same misunderstanding in the past -
https://forums.developer.nvidia.com/t/memory-metric-reported-by-nvml-and-nvdia-smi-seems-to-differ/75282
Nvidia docs
(https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t%5B/url%5D)
define memutil as
```
Percent of time over the past sample period during which global (device) memory was being read or written.
```

The fallback actually got the used memory for the process correctly, so
now I'm just using that to get the used memory while I use the newer api
to just get sm utilization.

Fixes #61456

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
…a vLLM plugin (#64067)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…64109)

Multi-GPU tests (core and rllib) on the T4 fleet fail at the first NCCL
collective with CUDA error 217 (cudaErrorPeerAccessUnsupported). NCCL's
cuMem path (default since 2.24) attempts GPU peer access during
transport setup, which is unsupported on T4 and errors out on the
current driver.

Set NCCL_CUMEM_HOST_ENABLE=0 in the shared GPU base image to force the
legacy host-buffer path, restoring multi-GPU collectives.

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…64059)

## Problem

The `download("image_url")` example in
`doc/source/data/loading-data.rst` downloads ImageNet JPEGs from the
public `ray-example-data` bucket using **ambient AWS credentials**. On
CI the PR-runner role lacks `s3:GetObject` on that bucket, so the
download is denied:

```
<Error><Code>AccessDenied</Code> ... not authorized to perform: s3:GetObject on
  arn:aws:s3:::ray-example-data/imagenet/train/n01440764/n01440764_10026.JPEG
FAILED doc/source/data/loading-data.rst::loading-data.rst - SystemExit: 15
//doc:source/data/loading-data   TIMEOUT in 3 out of 3 in 303.2s
```

This fails the `:database: data: doc tests` premerge job on **every PR
that actually re-runs that doctest** (i.e. a Bazel cache miss).
Confirmed identical across unrelated PRs — premerge builds 68131, 68134,
68140.

## Fix

Every other read on the page already uses anonymous access
(`s3://anonymous@…`), but the image URLs are baked into the metadata
parquet without it. Pass an anonymous pyarrow S3 filesystem to
`download()` so the example reads the public bucket **unsigned**,
independent of runner credentials.

Prerequisite (fixes anonymous downloads):
#64089

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pull pull Bot locked and limited conversation to collaborators Jun 16, 2026
@pull pull Bot added the ⤵️ pull label Jun 16, 2026
@pull pull Bot merged commit 4adc6e6 into garymm:master Jun 16, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants