[pull] master from ray-project:master#1075
Merged
Merged
Conversation
## Why are these changes needed? Custom search algorithms (OptunaSearch, HyperOpt, Ax, etc.) run trials **sequentially** even when `max_concurrent_trials` is set, because `_get_max_pending_trials()` hardcodes `max_pending_trials=1` for all non-`BasicVariantGenerator` searchers. When a user sets `max_concurrent_trials=N`, the searcher is wrapped in `ConcurrencyLimiter(searcher, max_concurrent=N)`, but the pending trial limit ignores this and still returns 1. This means the scheduler only ever queues one trial at a time, regardless of available cluster resources. The only workaround is the undocumented env var `TUNE_MAX_PENDING_TRIALS_PG`, which most users don't know about. ### Verified impact (Ray 2.54.1, 4 workers, 16 trials × 2s sleep, TPE sampler): | Config | Time | Result | |--------|------|--------| | `max_concurrent_trials=8` (before fix) | 60.4s | sequential | | `max_concurrent_trials=8` (after fix) | 12.0s | **parallel** | ## Changes In `_get_max_pending_trials()`: when the search algorithm wraps a `ConcurrencyLimiter` (via `max_concurrent_trials`), extract and use its `max_concurrent` value as the pending trial limit. Without `max_concurrent_trials`, behavior is unchanged (`return 1`). The `TUNE_MAX_PENDING_TRIALS_PG` env var override continues to work. **Total change: 3 lines of logic + 1 import.** ## Related issue numbers Fixes optuna/optuna#5611 Related: optuna/optuna#5958 Related: #48547 ## Checks - [x] I've signed all my commits with `--signoff` (DCO) - [x] I've run `pre-commit run ruff -a` — all hooks pass - [x] I've added tests in `test_controller_search_alg_integration.py` - [x] Changes are backward compatible (no behavior change without `max_concurrent_trials`) --------- Signed-off-by: Sebastian Schwartz <sebastian.schwartz@chicagotrading.com> Co-authored-by: Sebastian Schwartz <sebastian.schwartz@chicagotrading.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Description The way we got the per process used gpu memory that's displayed on the actor dashboard was wrong. <img width="3058" height="834" alt="image" src="https://github.com/user-attachments/assets/34e87301-4f96-48c2-8145-886dbbd37c64" /> We would use memutil from nvmlDeviceGetProcessesUtilizationInfo and then divide by 100 and multiply by the total memory on the gpu. But memutil is actually the percentage of time the memory bus is busy, not the percentage of memory used by the process... Someone else ran into this same misunderstanding in the past - https://forums.developer.nvidia.com/t/memory-metric-reported-by-nvml-and-nvdia-smi-seems-to-differ/75282 Nvidia docs (https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t%5B/url%5D) define memutil as ``` Percent of time over the past sample period during which global (device) memory was being read or written. ``` The fallback actually got the used memory for the process correctly, so now I'm just using that to get the used memory while I use the newer api to just get sm utilization. Fixes #61456 --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
…a vLLM plugin (#64067) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…64109) Multi-GPU tests (core and rllib) on the T4 fleet fail at the first NCCL collective with CUDA error 217 (cudaErrorPeerAccessUnsupported). NCCL's cuMem path (default since 2.24) attempts GPU peer access during transport setup, which is unsupported on T4 and errors out on the current driver. Set NCCL_CUMEM_HOST_ENABLE=0 in the shared GPU base image to force the legacy host-buffer path, restoring multi-GPU collectives. Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…64059) ## Problem The `download("image_url")` example in `doc/source/data/loading-data.rst` downloads ImageNet JPEGs from the public `ray-example-data` bucket using **ambient AWS credentials**. On CI the PR-runner role lacks `s3:GetObject` on that bucket, so the download is denied: ``` <Error><Code>AccessDenied</Code> ... not authorized to perform: s3:GetObject on arn:aws:s3:::ray-example-data/imagenet/train/n01440764/n01440764_10026.JPEG FAILED doc/source/data/loading-data.rst::loading-data.rst - SystemExit: 15 //doc:source/data/loading-data TIMEOUT in 3 out of 3 in 303.2s ``` This fails the `:database: data: doc tests` premerge job on **every PR that actually re-runs that doctest** (i.e. a Bazel cache miss). Confirmed identical across unrelated PRs — premerge builds 68131, 68134, 68140. ## Fix Every other read on the page already uses anonymous access (`s3://anonymous@…`), but the image URLs are baked into the metadata parquet without it. Pass an anonymous pyarrow S3 filesystem to `download()` so the example reads the public bucket **unsigned**, independent of runner credentials. Prerequisite (fixes anonymous downloads): #64089 Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )