Skip to content

[pull] master from ray-project:master#1078

Merged
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master
Jun 16, 2026
Merged

[pull] master from ray-project:master#1078
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

Conversation

@pull

@pull pull Bot commented Jun 16, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

enginarslan1 and others added 5 commits June 16, 2026 09:21
## Summary
- Treat only `None` as an unspecified resource isolation override.
- Reject explicit zero values for `system_reserved_cpu` and
`system_reserved_memory` instead of silently using defaults.
- Add regression coverage for disabled and enabled resource isolation
paths.

## Test plan
- `python3 -m py_compile
python/ray/_private/resource_isolation_config.py
python/ray/tests/resource_isolation/test_resource_isolation_config.py`
- `bazel test
//python/ray/tests/resource_isolation:test_resource_isolation_config`
(attempted locally, blocked by sandboxed download of `rules_python` from
GitHub)

Made with [Cursor](https://cursor.com)

Signed-off-by: enginarslan1 <heyengin@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ica gRPC server stop (#64022)

## Why are these changes needed?

When a replica shuts down gracefully, it stops its inter-deployment gRPC
server with `server.stop(grace)` (#63995). A handle-path dispatch whose
stream lands in the stop window — bytes already on the wire when
`stop()` fires, GOAWAY not yet processed by the router's channel — is
cancelled by the server core **before the handler ever runs**. The
router sees:

```
replica_result.py:449  await self._call.wait_for_connection()
grpc.aio.AioRpcError: StatusCode.CANCELLED ("received from peer")
```

`get_rejection_response()` already converts `UNAVAILABLE` into
`ActorUnavailableError` precisely so the router retries such dispatches
on another replica (the rejection protocol guarantees the request never
started executing when no `accepted` initial metadata was received).
`CANCELLED` falls past that branch and propagates as a raw
`AioRpcError`, which surfaces as an application 500 to the client.

Observed in sustained load testing (~18K RPS) with graceful autoscale
downscales: 3 requests out of 16.7M failed with exactly this traceback,
all dispatched to replicas at the instant they stopped their gRPC
server. The requests provably never executed (no `accepted` metadata;
CANCELLED at connection establishment) — they were safe to retry, and
the router already has the machinery to do so (`ActorUnavailableError` →
invalidate cache → re-route).

## What does this change do?

Extends the existing pre-accept error mapping in
`gRPCReplicaResult.get_rejection_response()` to treat peer-originated
`CANCELLED` the same as `UNAVAILABLE`:

- Only applies when **no `accepted` initial metadata** was received
(checked above in the same except block) — i.e., the replica never
accepted the request, so retrying cannot double-execute.
- A unary call that fails mid-execution carries `accepted=1` metadata in
the error and continues to be returned as accepted (NOT retried) —
unchanged, and covered by a new test.
- A locally-cancelled call surfaces as `asyncio.CancelledError` (handled
by the earlier except branch), not `AioRpcError`, so this branch only
sees peer-originated cancellation.
- No router changes: `_route_and_send_request_once` already catches
`ActorUnavailableError`, invalidates the queue-len cache entry, and
retries routing.

## Related issue number

Follow-up to #63995.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
… ingress mode (#64123)

Before, the receive task would be left on the queue because it was never
canceled in the request timeout path. This PR fixes that and adds a test
for it.

---------

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
…3830)

## Description
`SpilledObjectReader::ReadFromDataSection` / `ReadFromMetadataSection`
serve a spilled object to a remote node
(`ObjectManager::PushFromFilesystem`, on the multi-threaded
`rpc_service_` pool). The old code read each chunk **one byte at a
time** via `std::istreambuf_iterator` + `push_back`. For the default 5
MB chunk that's ~5M iterations through the streambuf interface per chunk
— pure CPU waste, worst exactly when the node is already under memory
pressure (spilling).

## What changed

`AppendFileSection` now:
1. Reads the chunk in a **single bulk `ifstream::read`** instead of the
per-byte `istreambuf_iterator` loop.
2. Grows the output buffer with
`absl::strings_internal::STLStringResizeUninitialized` instead of
`std::string::resize`, so the bytes that `read` immediately overwrites
aren't first zero-filled (skips a redundant full memory pass).

Behavior is otherwise unchanged: it appends to the tail of `output`
(data then metadata for a straddling chunk), reopens the file per call
(so it stays thread-safe and retry-friendly), and returns `false` on a
short/failed read.

## Benchmark

Microbenchmark replicating `ChunkObjectReader::GetChunk` (5 MB chunks,
t3a.xlarge / 4 vCPU), one object, page-cache **warm** (the common
spill→pull case):

| read strategy | 1 thread | 4 threads |

|----------------------------------------|-------------------|------------------|
| baseline (byte-by-byte) | 0.16 GB/s | 0.56 GB/s |
| **this PR** (bulk + uninitialized buf) | **5.86 GB/s (36x)** | **8.29
GB/s (15x)** |

- **Cold cache**: ~2x single-thread; disk-bound and converges across
strategies at higher concurrency — but no regression.
- Scale-invariant (re-checked at 8.6 GB: same GB/s).
- Also evaluated **cached-fd + `pread`** (per an earlier review
suggestion): perf-neutral here (+0–2%, within noise — an `open()` is
negligible next to a 5 MB read), so I kept the simpler per-call
`ifstream`. The win is the bulk read + skipping the zero-fill.

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: Yuanzhuo Yang <89326662+ShockYoungCHN@users.noreply.github.com>
Co-authored-by: Yuanzhuo Yang <89326662+ShockYoungCHN@users.noreply.github.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
## Description
> Briefly describe what this PR accomplishes and why it's needed.

This PR adds support for capturing JAX execution profiles directly from
the Ray Dashboard. This is particularly useful for debugging performance
issues in JAX-based workloads running on TPU clusters.
To maintain consistency, this implementation follows the existing
pattern established by the GPU profiler
([GpuProfilingManager](https://github.com/ray-project/ray/blob/master/python/ray/dashboard/modules/reporter/gpu_profile_manager.py#L20))
in the Ray dashboard.

Changes:
1. reporter_head.py: Added a new HTTP endpoint /worker/jax_profile to
the dashboard head. This endpoint receives profiling requests and routes
them via gRPC to the appropriate ReporterAgent on the target node.
2. reporter_agent.py: Added handling for the JaxProfiling gRPC request
and instantiated the JaxProfilingManager
3. reporter.proto: Defined JaxProfilingRequest and JaxProfilingReply
messages and added the JaxProfiling RPC method to the ReporterService
4. jax_profile_manager.py: Created this new manager to handle the actual
profiling interaction. It uses
tensorflow.python.profiler.profiler_client to connect to the JAX
profiler server running on the worker process.
5. Unit tests to verify success and failure paths of the profiling
manager, including mocking the TensorFlow profiler client.

How to verify:
1. Trigger profiling via curl or browser:
```
curl "http://<dashboard-ip>:<port>/worker/jax_profile?pid=<worker_pid>&port=<jax_port>&ip=<worker_ip>&duration=5"
```
2. expected response
```
{"result": true, "msg": "JAX profiling finished.", "data": {"traceDirectory": "profiles"}}
```
3. View in Tensorboard: Copy the generated `.xplane.pb` file from the
worker pod's log directory and view it using the TensorBoard profile
plugin.

## Related issues

## Additional information

Captured JAX profile, rendered via TensorBoard:

Overview: 
<img width="3456" height="2160" alt="image"
src="https://github.com/user-attachments/assets/869df329-7f1d-4649-bfe0-dc1b2eb22d1f"
/>

Trace view:
<img width="3428" height="1852" alt="image"
src="https://github.com/user-attachments/assets/df7a40c1-1f23-47e1-9fa4-df41cebd7e3d"
/>

Signed-off-by: Richa Banker <richabanker@google.com>
@pull pull Bot locked and limited conversation to collaborators Jun 16, 2026
@pull pull Bot added the ⤵️ pull label Jun 16, 2026
@pull pull Bot merged commit 8b22e57 into garymm:master Jun 16, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants