[pull] master from ray-project:master by pull[bot] · Pull Request #1078 · garymm/ray

pull · 2026-06-16T19:18:20Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

## Summary - Treat only `None` as an unspecified resource isolation override. - Reject explicit zero values for `system_reserved_cpu` and `system_reserved_memory` instead of silently using defaults. - Add regression coverage for disabled and enabled resource isolation paths. ## Test plan - `python3 -m py_compile python/ray/_private/resource_isolation_config.py python/ray/tests/resource_isolation/test_resource_isolation_config.py` - `bazel test //python/ray/tests/resource_isolation:test_resource_isolation_config` (attempted locally, blocked by sandboxed download of `rules_python` from GitHub) Made with [Cursor](https://cursor.com) Signed-off-by: enginarslan1 <heyengin@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ica gRPC server stop (#64022) ## Why are these changes needed? When a replica shuts down gracefully, it stops its inter-deployment gRPC server with `server.stop(grace)` (#63995). A handle-path dispatch whose stream lands in the stop window — bytes already on the wire when `stop()` fires, GOAWAY not yet processed by the router's channel — is cancelled by the server core **before the handler ever runs**. The router sees: ``` replica_result.py:449 await self._call.wait_for_connection() grpc.aio.AioRpcError: StatusCode.CANCELLED ("received from peer") ``` `get_rejection_response()` already converts `UNAVAILABLE` into `ActorUnavailableError` precisely so the router retries such dispatches on another replica (the rejection protocol guarantees the request never started executing when no `accepted` initial metadata was received). `CANCELLED` falls past that branch and propagates as a raw `AioRpcError`, which surfaces as an application 500 to the client. Observed in sustained load testing (~18K RPS) with graceful autoscale downscales: 3 requests out of 16.7M failed with exactly this traceback, all dispatched to replicas at the instant they stopped their gRPC server. The requests provably never executed (no `accepted` metadata; CANCELLED at connection establishment) — they were safe to retry, and the router already has the machinery to do so (`ActorUnavailableError` → invalidate cache → re-route). ## What does this change do? Extends the existing pre-accept error mapping in `gRPCReplicaResult.get_rejection_response()` to treat peer-originated `CANCELLED` the same as `UNAVAILABLE`: - Only applies when **no `accepted` initial metadata** was received (checked above in the same except block) — i.e., the replica never accepted the request, so retrying cannot double-execute. - A unary call that fails mid-execution carries `accepted=1` metadata in the error and continues to be returned as accepted (NOT retried) — unchanged, and covered by a new test. - A locally-cancelled call surfaces as `asyncio.CancelledError` (handled by the earlier except branch), not `AioRpcError`, so this branch only sees peer-originated cancellation. - No router changes: `_route_and_send_request_once` already catches `ActorUnavailableError`, invalidates the queue-len cache entry, and retries routing. ## Related issue number Follow-up to #63995. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

… ingress mode (#64123) Before, the receive task would be left on the queue because it was never canceled in the request timeout path. This PR fixes that and adds a test for it. --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

…3830) ## Description `SpilledObjectReader::ReadFromDataSection` / `ReadFromMetadataSection` serve a spilled object to a remote node (`ObjectManager::PushFromFilesystem`, on the multi-threaded `rpc_service_` pool). The old code read each chunk **one byte at a time** via `std::istreambuf_iterator` + `push_back`. For the default 5 MB chunk that's ~5M iterations through the streambuf interface per chunk — pure CPU waste, worst exactly when the node is already under memory pressure (spilling). ## What changed `AppendFileSection` now: 1. Reads the chunk in a **single bulk `ifstream::read`** instead of the per-byte `istreambuf_iterator` loop. 2. Grows the output buffer with `absl::strings_internal::STLStringResizeUninitialized` instead of `std::string::resize`, so the bytes that `read` immediately overwrites aren't first zero-filled (skips a redundant full memory pass). Behavior is otherwise unchanged: it appends to the tail of `output` (data then metadata for a straddling chunk), reopens the file per call (so it stays thread-safe and retry-friendly), and returns `false` on a short/failed read. ## Benchmark Microbenchmark replicating `ChunkObjectReader::GetChunk` (5 MB chunks, t3a.xlarge / 4 vCPU), one object, page-cache **warm** (the common spill→pull case): | read strategy | 1 thread | 4 threads | |----------------------------------------|-------------------|------------------| | baseline (byte-by-byte) | 0.16 GB/s | 0.56 GB/s | | **this PR** (bulk + uninitialized buf) | **5.86 GB/s (36x)** | **8.29 GB/s (15x)** | - **Cold cache**: ~2x single-thread; disk-bound and converges across strategies at higher concurrency — but no regression. - Scale-invariant (re-checked at 8.6 GB: same GB/s). - Also evaluated **cached-fd + `pread`** (per an earlier review suggestion): perf-neutral here (+0–2%, within noise — an `open()` is negligible next to a 5 MB read), so I kept the simpler per-call `ifstream`. The win is the bulk read + skipping the zero-fill. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: Yuanzhuo Yang <89326662+ShockYoungCHN@users.noreply.github.com> Co-authored-by: Yuanzhuo Yang <89326662+ShockYoungCHN@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

## Description > Briefly describe what this PR accomplishes and why it's needed. This PR adds support for capturing JAX execution profiles directly from the Ray Dashboard. This is particularly useful for debugging performance issues in JAX-based workloads running on TPU clusters. To maintain consistency, this implementation follows the existing pattern established by the GPU profiler ([GpuProfilingManager](https://github.com/ray-project/ray/blob/master/python/ray/dashboard/modules/reporter/gpu_profile_manager.py#L20)) in the Ray dashboard. Changes: 1. reporter_head.py: Added a new HTTP endpoint /worker/jax_profile to the dashboard head. This endpoint receives profiling requests and routes them via gRPC to the appropriate ReporterAgent on the target node. 2. reporter_agent.py: Added handling for the JaxProfiling gRPC request and instantiated the JaxProfilingManager 3. reporter.proto: Defined JaxProfilingRequest and JaxProfilingReply messages and added the JaxProfiling RPC method to the ReporterService 4. jax_profile_manager.py: Created this new manager to handle the actual profiling interaction. It uses tensorflow.python.profiler.profiler_client to connect to the JAX profiler server running on the worker process. 5. Unit tests to verify success and failure paths of the profiling manager, including mocking the TensorFlow profiler client. How to verify: 1. Trigger profiling via curl or browser: ``` curl "http://<dashboard-ip>:<port>/worker/jax_profile?pid=<worker_pid>&port=<jax_port>&ip=<worker_ip>&duration=5" ``` 2. expected response ``` {"result": true, "msg": "JAX profiling finished.", "data": {"traceDirectory": "profiles"}} ``` 3. View in Tensorboard: Copy the generated `.xplane.pb` file from the worker pod's log directory and view it using the TensorBoard profile plugin. ## Related issues ## Additional information Captured JAX profile, rendered via TensorBoard: Overview: <img width="3456" height="2160" alt="image" src="https://github.com/user-attachments/assets/869df329-7f1d-4649-bfe0-dc1b2eb22d1f" /> Trace view: <img width="3428" height="1852" alt="image" src="https://github.com/user-attachments/assets/df7a40c1-1f23-47e1-9fa4-df41cebd7e3d" /> Signed-off-by: Richa Banker <richabanker@google.com>

enginarslan1 and others added 5 commits June 16, 2026 09:21

pull Bot locked and limited conversation to collaborators Jun 16, 2026

pull Bot added the ⤵️ pull label Jun 16, 2026

pull Bot merged commit 8b22e57 into garymm:master Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#1078

[pull] master from ray-project:master#1078
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

pull Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pull Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull Bot commented Jun 16, 2026 •

edited

Loading