[pull] master from ray-project:master#1069
Merged
Merged
Conversation
#63890) obstore's S3Store defaults region to us-east-1 and does not follow AWS PermanentRedirect responses, so any obstore-routed S3 request against a bucket in a different region fails non-retryably with BareRedirect. - `_split_obstore_uri` rewrites `https://s3.<region>.amazonaws.com/<bucket>/<key>` to s3://<bucket> + <key> so StoreRegistry can apply region discovery. - `_discover_aws_bucket_region` resolves a bucket's region via `pyarrow.fs.resolve_s3_region` (already a required Ray Data dependency), cached per bucket. PyArrow issues the `x-amz-bucket-region` HEAD probe and handles the legacy global endpoint / IMDS edge cases; we additionally cache negative results so unresolvable buckets are probed at most once. The probe runs outside the cache lock, and the write-back never lets a `None` result overwrite a region a concurrent thread already cached (a real region always wins), so racing first-time lookups can't intermittently disable region injection. - `StoreRegistry.get` injects the discovered region for `s3://, s3a://` URLs, skipping injection when the caller already supplied a region or a custom endpoint (MinIO/R2/etc.). - All obstore call sites — the HEAD size probe (`_resolve_size`), the actor HEAD path (`_head_one`), ranged downloads (`_fetch_ranged`), and whole-file GET (`_fetch`) — go through `_split_obstore_uri`, so a path-style cross-region URL no longer slips past the rewrite (which previously left the size probe on the regional HTTPS store, returning size 0 and wrongly skipping ranged downloads). GCS and Azure are unaffected: neither encodes region in the endpoint (GCS uses a global endpoint addressed by bucket name; Azure is keyed by storage account), so they have no cross-region redirect failure mode. --------- Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tream ray.get (#64014) `_next_sync` documents *"if an object is not available within the given timeout, it returns a nil object reference"*, but its end-of-stream handling calls `ray.get(generator_ref)` with **no timeout** (to distinguish a normal end of the stream from a task failure). The bug: after all yielded refs are consumed, the next `_next_sync(timeout_s=...)` call reaches that get. The generator's return object normally resolves locally, but if it lives in plasma and its node died, the get blocks the calling thread until lineage reconstruction re-runs the task — which needs a free CPU. On a saturated cluster this can deadlock: the blocked caller (e.g. Ray Data's scheduling thread) is what consumes outputs and releases the CPUs held by output-backpressured tasks, so reconstruction can never start. Serve's `to_object_ref(timeout_s=...)` similarly blocks past the user's requested timeout when a replica node dies. - Apply the caller's `timeout_s` to the end-of-stream get; report a timeout as a nil ref (retry), per the documented contract. - `timeout_s=None` (and `-1`) keep the blocking behavior, so `__next__` and other timeout-less callers are unchanged. - Regression test: stream exhausted + return object lost with its node → `_next_sync(timeout_s=0)` returns nil instead of blocking (hangs forever without the fix), and the stream terminates normally once the node is restored. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…64034) The Windows base image build (ci/ray_ci/windows/build_base.sh) crashes when running `conda update -c conda-forge ca-certificates certifi`: AttributeError: module 'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK' Upgrading the conda base env to python 3.10 (`conda install python=...`) pulls cryptography>=38, which removed `_lib.X509_V_FLAG_CB_ISSUER_CHECK`. pyopenssl is not part of that transaction, so the stale py3.8-era pyopenssl is left behind and still references the removed attribute at import. The next conda invocation imports requests -> urllib3.contrib.pyopenssl -> OpenSSL.crypto and detonates before conda can run, failing the base image build. Co-resolve pyopenssl 23.2.0 in the same conda install transaction so it stays compatible with the cryptography 38.x that gets installed. --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… id (#64044) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…in (#64021) ## Why are these changes needed? `RAY_SERVE_PORT_QUARANTINE_S` holds a released direct-ingress replica port out of the allocation pool so that stale routing state pointing at the old replica drains before another replica can inherit the port. It currently defaults to **10 seconds**. The consumers that hold stale routing state the longest are soft-stopped (reloaded-out) HAProxy worker processes: they run no health checks (see [haproxy#3330](haproxy/haproxy#3330)) and keep routing to their frozen server list until `hard-stop-after` fires — **120s by default** (`RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S`), commonly configured higher. With the current defaults the quarantine is 12x shorter than the window it exists to outlive: a freed port can be handed to a *different app's* replica at +10s while old workers keep sending it the previous app's traffic for up to +120s. Observed in sustained load testing: a just-freed direct-ingress port was recycled into another app's replica inside the stale-worker window, and a soft-stopped worker routed the old app's traffic to it — surfacing as unretried wrong-app 404s at the client. Health checks cannot catch this (they validate the address is serving, not which app is serving). ## What does this change do? Derives the default quarantine from the hard-stop window instead of a fixed 10s: ```python RAY_SERVE_PORT_QUARANTINE_S = get_env_float_non_negative( "RAY_SERVE_PORT_QUARANTINE_S", float(RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S + 30), ) ``` The `+30s` margin covers the broadcast/coalesce/reload latency that elapses before an old worker's hard-stop clock starts (the clock runs from the worker's *orphaning* at reload, which can lag the port release). An explicit `RAY_SERVE_PORT_QUARANTINE_S` still overrides, and `0` still disables quarantining entirely. Sizing rule this encodes (must hold for correctness, now holds by default): ``` port quarantine >= hard-stop-after + reload propagation lag ``` Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )