Skip to content

[pull] master from ray-project:master#1071

Merged
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master
Jun 12, 2026
Merged

[pull] master from ray-project:master#1071
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

Conversation

@pull

@pull pull Bot commented Jun 12, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

karticam and others added 5 commits June 12, 2026 09:04
…terrupts (#63663)

This PR builds on #61102 and
resolves this stack trace mentioned in the PR.

For quick overview, this is the bug: 
After many workers were OOM-killed, regular task workers crashed with:
Check failed: objects_valid 1 return objects expected, 1 returned.
Object at idx 0 was not stored.

The CHECK in `TaskReceiver::HandleTaskExecutionResult` fires when the
task execution handler returns `Status::OK()` but a return object slot
is still `nullptr`.

The linked PR explains how the exception could be caused due to a double
`ray.cancel()` on the task. Although I couldn't find double cancellation
triggered on any worker when checking the failing job logs, double
`ray.cancel()` could nonetheless generate this exception (repro in the
added test case in the PR). So while the exact trigger of this incident
might be something else that produces the same `status=OK` /
`return_objects[i].second=nullptr` state, the fixes proposed in the PR
are definitely good safeguards that should be added.

---------

Signed-off-by: Kartica Modi <karticamodi@gmail.com>
Signed-off-by: kartica.modi <kartica.modi@anyscale.com>
…ion and deduplicate label deprecation warnings (#63955)

## Motivation

This PR addresses two operational issues in the KubeRay autoscaler:

1. **Log Rotation Flexibility**: Operators need to customize log
rotation settings based on their deployment environment's disk space and
retention policies. Hardcoded values don't work well across different
environments (development vs production, small vs large clusters).

2. **Log Noise Reduction**: The repeated deprecation warning for
`rayStartParams.labels` clutters logs, making it harder to identify
genuine issues. Since KubeRay v1.5+ recommends using the top-level
`Labels` field, the warning should inform users once without spamming.

## Implementation Details

### Files Modified
- `python/ray/autoscaler/_private/kuberay/run_autoscaler.py`
- `python/ray/autoscaler/_private/kuberay/autoscaling_config.py`

### Changes

#### 1. Environment Variable Configuration for Log Rotation
**File**: `run_autoscaler.py`

Added support for `RAY_ROTATION_MAX_BYTES` and
`RAY_ROTATION_BACKUP_COUNT` environment variables:

```python
# Before (hardcoded)
setup_component_logger(
    max_bytes=LOGGING_ROTATE_BYTES,
    backup_count=LOGGING_ROTATE_BACKUP_COUNT,
)

# After (environment variable override)
max_bytes = int(os.getenv("RAY_ROTATION_MAX_BYTES", LOGGING_ROTATE_BYTES))
backup_count = int(os.getenv("RAY_ROTATION_BACKUP_COUNT", LOGGING_ROTATE_BACKUP_COUNT))
setup_component_logger(
    max_bytes=max_bytes,
    backup_count=backup_count,
)
```

**Usage Example**:
```yaml
# In RayCluster CR
spec:
  headGroupSpec:
    rayStartParams:
      RAY_ROTATION_MAX_BYTES: "52428800"  # 50MB
      RAY_ROTATION_BACKUP_COUNT: "10"     # Keep 10 backups
```

#### 2. Deduplicate Label Deprecation Warning
**File**: `autoscaling_config.py`

Added `log_once()` to ensure the warning is printed only once:

```python
# Before (repeated on every iteration)
if labels_str:
    logger.warning(...)

# After (printed once)
if labels_str and log_once("raystartparams_labels_warning"):
    logger.warning(...)
```

The `log_once()` function from `ray.util.debug` uses an internal flag to
ensure the message is logged only the first time the condition is met.

### Breaking Changes
None. This is a backward-compatible enhancement:
- Environment variables are optional (fallback to existing constants)
- Warning behavior is reduced (from repeated to once), which is an
improvement

## Verification

### Test Log Rotation Configuration
1. Deploy a RayCluster with custom environment variables:
   ```bash
   kubectl set env deployment/raycluster-kuberay-head \
     RAY_ROTATION_MAX_BYTES=10485760 \
     RAY_ROTATION_BACKUP_COUNT=5
   ```
2. Generate log activity and verify rotation occurs at 10MB instead of
default
3. Verify only 5 backup files are retained

### Test Label Warning Deduplication
1. Deploy a RayCluster with deprecated `rayStartParams.labels`:
   ```yaml
   workerGroupSpecs:
   - rayStartParams:
       labels: "{\"app\": \"myapp\"}"
   ```
2. Wait for multiple autoscaler iterations (e.g., 1 minute)
3. Check autoscaler logs:
   ```bash
   kubectl logs <head-pod> -c autoscaler | grep "Ignoring labels"
   ```
4. **Expected**: Warning appears only once
5. **Previous behavior**: Warning appeared every 5 seconds

### Unit Tests
No new tests added - this is a configuration enhancement that doesn't
change core autoscaler logic. Existing tests should continue to pass.

## Related Issues
Close #63954

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Ray Data implements data processing operators with streaming generators
in a special way that yields 2 times in a single iteration: first, the
block, then the block metadata. This works with today’s task-based
streaming generator backpressure if the quota is 2
(`_generator_backpressure_num_objects`=2), which allows both yields to
arrive at the caller without blocking.

However, Ray Data is moving to actor-based backpressure, where multiple
streaming generators on the same actor share the same backpressure
quota. The problem is that with shared backpressure quota, one yield
from a streaming generator can be blocked by another yield in another
streaming generator. For example, if a caller launches 2 concurrent
streaming generators on the actor at the same time, they will not be
able to yield the metadata after yielding the block until the caller
takes those blocks out because the quota is already drained. However,
the caller can’t handle those blocks alone without their metadata. It
will need another iteration to take all the metadata out. The result of
this is that we will have bad caller performance since it needs more
round trips to make progress on these concurrent generators.

### Solution
To solve this problem, we need a way to atomically count multiple yields
for backpressure. Introducing `_num_objects_per_yield`, a new private
option that declares the number of ray references to unpack for each
yield, similar to the current num_returns on a Ray task.

```python
@ray.remote(_num_objects_per_yield =2)
def generator():
    for _ in range(10):
        yield block, meta

gen = generator.remore()
assert ray.get(gen._next_sync()) == block
assert ray.get(gen._next_sync()) == meta
```

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pull pull Bot locked and limited conversation to collaborators Jun 12, 2026
@pull pull Bot added the ⤵️ pull label Jun 12, 2026
@pull pull Bot merged commit 90c5a3a into garymm:master Jun 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants