[pull] master from ray-project:master by pull[bot] · Pull Request #1071 · garymm/ray

pull · 2026-06-12T19:18:17Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…terrupts (#63663) This PR builds on #61102 and resolves this stack trace mentioned in the PR. For quick overview, this is the bug: After many workers were OOM-killed, regular task workers crashed with: Check failed: objects_valid 1 return objects expected, 1 returned. Object at idx 0 was not stored. The CHECK in `TaskReceiver::HandleTaskExecutionResult` fires when the task execution handler returns `Status::OK()` but a return object slot is still `nullptr`. The linked PR explains how the exception could be caused due to a double `ray.cancel()` on the task. Although I couldn't find double cancellation triggered on any worker when checking the failing job logs, double `ray.cancel()` could nonetheless generate this exception (repro in the added test case in the PR). So while the exact trigger of this incident might be something else that produces the same `status=OK` / `return_objects[i].second=nullptr` state, the fixes proposed in the PR are definitely good safeguards that should be added. --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com> Signed-off-by: kartica.modi <kartica.modi@anyscale.com>

…ion and deduplicate label deprecation warnings (#63955) ## Motivation This PR addresses two operational issues in the KubeRay autoscaler: 1. **Log Rotation Flexibility**: Operators need to customize log rotation settings based on their deployment environment's disk space and retention policies. Hardcoded values don't work well across different environments (development vs production, small vs large clusters). 2. **Log Noise Reduction**: The repeated deprecation warning for `rayStartParams.labels` clutters logs, making it harder to identify genuine issues. Since KubeRay v1.5+ recommends using the top-level `Labels` field, the warning should inform users once without spamming. ## Implementation Details ### Files Modified - `python/ray/autoscaler/_private/kuberay/run_autoscaler.py` - `python/ray/autoscaler/_private/kuberay/autoscaling_config.py` ### Changes #### 1. Environment Variable Configuration for Log Rotation **File**: `run_autoscaler.py` Added support for `RAY_ROTATION_MAX_BYTES` and `RAY_ROTATION_BACKUP_COUNT` environment variables: ```python # Before (hardcoded) setup_component_logger( max_bytes=LOGGING_ROTATE_BYTES, backup_count=LOGGING_ROTATE_BACKUP_COUNT, ) # After (environment variable override) max_bytes = int(os.getenv("RAY_ROTATION_MAX_BYTES", LOGGING_ROTATE_BYTES)) backup_count = int(os.getenv("RAY_ROTATION_BACKUP_COUNT", LOGGING_ROTATE_BACKUP_COUNT)) setup_component_logger( max_bytes=max_bytes, backup_count=backup_count, ) ``` **Usage Example**: ```yaml # In RayCluster CR spec: headGroupSpec: rayStartParams: RAY_ROTATION_MAX_BYTES: "52428800" # 50MB RAY_ROTATION_BACKUP_COUNT: "10" # Keep 10 backups ``` #### 2. Deduplicate Label Deprecation Warning **File**: `autoscaling_config.py` Added `log_once()` to ensure the warning is printed only once: ```python # Before (repeated on every iteration) if labels_str: logger.warning(...) # After (printed once) if labels_str and log_once("raystartparams_labels_warning"): logger.warning(...) ``` The `log_once()` function from `ray.util.debug` uses an internal flag to ensure the message is logged only the first time the condition is met. ### Breaking Changes None. This is a backward-compatible enhancement: - Environment variables are optional (fallback to existing constants) - Warning behavior is reduced (from repeated to once), which is an improvement ## Verification ### Test Log Rotation Configuration 1. Deploy a RayCluster with custom environment variables: ```bash kubectl set env deployment/raycluster-kuberay-head \ RAY_ROTATION_MAX_BYTES=10485760 \ RAY_ROTATION_BACKUP_COUNT=5 ``` 2. Generate log activity and verify rotation occurs at 10MB instead of default 3. Verify only 5 backup files are retained ### Test Label Warning Deduplication 1. Deploy a RayCluster with deprecated `rayStartParams.labels`: ```yaml workerGroupSpecs: - rayStartParams: labels: "{\"app\": \"myapp\"}" ``` 2. Wait for multiple autoscaler iterations (e.g., 1 minute) 3. Check autoscaler logs: ```bash kubectl logs <head-pod> -c autoscaler | grep "Ignoring labels" ``` 4. **Expected**: Warning appears only once 5. **Previous behavior**: Warning appeared every 5 seconds ### Unit Tests No new tests added - this is a configuration enhancement that doesn't change core autoscaler logic. Existing tests should continue to pass. ## Related Issues Close #63954 --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Ray Data implements data processing operators with streaming generators in a special way that yields 2 times in a single iteration: first, the block, then the block metadata. This works with today’s task-based streaming generator backpressure if the quota is 2 (`_generator_backpressure_num_objects`=2), which allows both yields to arrive at the caller without blocking. However, Ray Data is moving to actor-based backpressure, where multiple streaming generators on the same actor share the same backpressure quota. The problem is that with shared backpressure quota, one yield from a streaming generator can be blocked by another yield in another streaming generator. For example, if a caller launches 2 concurrent streaming generators on the actor at the same time, they will not be able to yield the metadata after yielding the block until the caller takes those blocks out because the quota is already drained. However, the caller can’t handle those blocks alone without their metadata. It will need another iteration to take all the metadata out. The result of this is that we will have bad caller performance since it needs more round trips to make progress on these concurrent generators. ### Solution To solve this problem, we need a way to atomically count multiple yields for backpressure. Introducing `_num_objects_per_yield`, a new private option that declares the number of ray references to unpack for each yield, similar to the current num_returns on a Ray task. ```python @ray.remote(_num_objects_per_yield =2) def generator(): for _ in range(10): yield block, meta gen = generator.remore() assert ray.get(gen._next_sync()) == block assert ray.get(gen._next_sync()) == meta ``` --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

…n observability (#63807)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

karticam and others added 5 commits June 12, 2026 09:04

[train][Preemption handling 1/n] Add preemption watcher for node-drai…

2f0866c

…n observability (#63807)

[serve.llm] Add MoRIIO KV-connector backend for prefill/decode (#63951)

90c5a3a

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pull Bot locked and limited conversation to collaborators Jun 12, 2026

pull Bot added the ⤵️ pull label Jun 12, 2026

pull Bot merged commit 90c5a3a into garymm:master Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#1071

[pull] master from ray-project:master#1071
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

pull Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pull Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull Bot commented Jun 12, 2026 •

edited

Loading