[pull] master from ray-project:master#1071
Merged
Merged
Conversation
…terrupts (#63663) This PR builds on #61102 and resolves this stack trace mentioned in the PR. For quick overview, this is the bug: After many workers were OOM-killed, regular task workers crashed with: Check failed: objects_valid 1 return objects expected, 1 returned. Object at idx 0 was not stored. The CHECK in `TaskReceiver::HandleTaskExecutionResult` fires when the task execution handler returns `Status::OK()` but a return object slot is still `nullptr`. The linked PR explains how the exception could be caused due to a double `ray.cancel()` on the task. Although I couldn't find double cancellation triggered on any worker when checking the failing job logs, double `ray.cancel()` could nonetheless generate this exception (repro in the added test case in the PR). So while the exact trigger of this incident might be something else that produces the same `status=OK` / `return_objects[i].second=nullptr` state, the fixes proposed in the PR are definitely good safeguards that should be added. --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com> Signed-off-by: kartica.modi <kartica.modi@anyscale.com>
…ion and deduplicate label deprecation warnings (#63955) ## Motivation This PR addresses two operational issues in the KubeRay autoscaler: 1. **Log Rotation Flexibility**: Operators need to customize log rotation settings based on their deployment environment's disk space and retention policies. Hardcoded values don't work well across different environments (development vs production, small vs large clusters). 2. **Log Noise Reduction**: The repeated deprecation warning for `rayStartParams.labels` clutters logs, making it harder to identify genuine issues. Since KubeRay v1.5+ recommends using the top-level `Labels` field, the warning should inform users once without spamming. ## Implementation Details ### Files Modified - `python/ray/autoscaler/_private/kuberay/run_autoscaler.py` - `python/ray/autoscaler/_private/kuberay/autoscaling_config.py` ### Changes #### 1. Environment Variable Configuration for Log Rotation **File**: `run_autoscaler.py` Added support for `RAY_ROTATION_MAX_BYTES` and `RAY_ROTATION_BACKUP_COUNT` environment variables: ```python # Before (hardcoded) setup_component_logger( max_bytes=LOGGING_ROTATE_BYTES, backup_count=LOGGING_ROTATE_BACKUP_COUNT, ) # After (environment variable override) max_bytes = int(os.getenv("RAY_ROTATION_MAX_BYTES", LOGGING_ROTATE_BYTES)) backup_count = int(os.getenv("RAY_ROTATION_BACKUP_COUNT", LOGGING_ROTATE_BACKUP_COUNT)) setup_component_logger( max_bytes=max_bytes, backup_count=backup_count, ) ``` **Usage Example**: ```yaml # In RayCluster CR spec: headGroupSpec: rayStartParams: RAY_ROTATION_MAX_BYTES: "52428800" # 50MB RAY_ROTATION_BACKUP_COUNT: "10" # Keep 10 backups ``` #### 2. Deduplicate Label Deprecation Warning **File**: `autoscaling_config.py` Added `log_once()` to ensure the warning is printed only once: ```python # Before (repeated on every iteration) if labels_str: logger.warning(...) # After (printed once) if labels_str and log_once("raystartparams_labels_warning"): logger.warning(...) ``` The `log_once()` function from `ray.util.debug` uses an internal flag to ensure the message is logged only the first time the condition is met. ### Breaking Changes None. This is a backward-compatible enhancement: - Environment variables are optional (fallback to existing constants) - Warning behavior is reduced (from repeated to once), which is an improvement ## Verification ### Test Log Rotation Configuration 1. Deploy a RayCluster with custom environment variables: ```bash kubectl set env deployment/raycluster-kuberay-head \ RAY_ROTATION_MAX_BYTES=10485760 \ RAY_ROTATION_BACKUP_COUNT=5 ``` 2. Generate log activity and verify rotation occurs at 10MB instead of default 3. Verify only 5 backup files are retained ### Test Label Warning Deduplication 1. Deploy a RayCluster with deprecated `rayStartParams.labels`: ```yaml workerGroupSpecs: - rayStartParams: labels: "{\"app\": \"myapp\"}" ``` 2. Wait for multiple autoscaler iterations (e.g., 1 minute) 3. Check autoscaler logs: ```bash kubectl logs <head-pod> -c autoscaler | grep "Ignoring labels" ``` 4. **Expected**: Warning appears only once 5. **Previous behavior**: Warning appeared every 5 seconds ### Unit Tests No new tests added - this is a configuration enhancement that doesn't change core autoscaler logic. Existing tests should continue to pass. ## Related Issues Close #63954 --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Ray Data implements data processing operators with streaming generators
in a special way that yields 2 times in a single iteration: first, the
block, then the block metadata. This works with today’s task-based
streaming generator backpressure if the quota is 2
(`_generator_backpressure_num_objects`=2), which allows both yields to
arrive at the caller without blocking.
However, Ray Data is moving to actor-based backpressure, where multiple
streaming generators on the same actor share the same backpressure
quota. The problem is that with shared backpressure quota, one yield
from a streaming generator can be blocked by another yield in another
streaming generator. For example, if a caller launches 2 concurrent
streaming generators on the actor at the same time, they will not be
able to yield the metadata after yielding the block until the caller
takes those blocks out because the quota is already drained. However,
the caller can’t handle those blocks alone without their metadata. It
will need another iteration to take all the metadata out. The result of
this is that we will have bad caller performance since it needs more
round trips to make progress on these concurrent generators.
### Solution
To solve this problem, we need a way to atomically count multiple yields
for backpressure. Introducing `_num_objects_per_yield`, a new private
option that declares the number of ray references to unpack for each
yield, similar to the current num_returns on a Ray task.
```python
@ray.remote(_num_objects_per_yield =2)
def generator():
for _ in range(10):
yield block, meta
gen = generator.remore()
assert ray.get(gen._next_sync()) == block
assert ray.get(gen._next_sync()) == meta
```
---------
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
…n observability (#63807)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )