Skip to content

[pull] master from ray-project:master#1068

Merged
pull[bot] merged 3 commits into
garymm:masterfrom
ray-project:master
Jun 11, 2026
Merged

[pull] master from ray-project:master#1068
pull[bot] merged 3 commits into
garymm:masterfrom
ray-project:master

Conversation

@pull

@pull pull Bot commented Jun 11, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

kouroshHakha and others added 3 commits June 11, 2026 08:39
)

---------

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The assertion assumed that `a` and `b` would always be scheduled and
none of `c`, which is not guaranteed.

Added a condition to wait until `a` and `b` are scheduled before
submitting the `c`s.

I couldn't use `SignalActor` here because it would add task metrics.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
We have observed that the idle worker memory usage as reported in the
per component memory usage metric grows over time. Previous
investigation has concluded that it's possible for idle workers to
"leak" memory due to imported libraries caching memory regions between
tasks, resulting in a large idle worker footprint between runs. We have
also previously uncovered that memory leaks in popular libraries can
also result in idle worker memory growth:
apache/arrow#39808.

However a recent investigation into the trivial workload below with no
library usage still showed idle worker memory growth over time.
```py
import ray
from tqdm import tqdm
import time

ray.init()

@ray.remote
def produce():
    a = b"0" * int(0.5 * 1024**3)
    time.sleep(5)
    return a

@ray.remote
def consume(bytes):
    time.sleep(5)
    return "done"

refs = []
for _ in tqdm(range(16)):
    ref = produce.remote()
    refs.append(ref)
    consume_refs = []
    for i in range(4):
        consume_refs.append(consume.remote(ref))
    print(ray.get(consume_refs))
```
Per component metric showing idle worker memory growth:
<img width="1687" height="398" alt="image"
src="https://github.com/user-attachments/assets/a8201713-d64a-44ff-bfd3-d9e070dd88b3"
/>

This growth is attributed to how "unique set size" is calculated in the
query for per component memory usage metric. Previously, we computed the
unique set size via the following query:
```
(sum(ray_component_rss_mb{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"} * 1024 * 1024) by (Component)) - (sum(ray_component_mem_shared_bytes{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"}) by (Component))
```
The problematic part is the `* 1024 * 1024` which attempts to convert
the rss back into bytes. However, the RSS was actually recorded in
megabytes instead mebibytes, resulting in over calculation. As the RSS
for the worker grows, this calculation grows increasingly incorrect as
shown by the graph below where we demonstrate the difference between
computing `pre-existing query (were we incorrectly convert megabytes
into bytes)` and the actual `unique set size`.
<img width="1189" height="590" alt="image"
src="https://github.com/user-attachments/assets/5537ac8a-efb2-4160-926c-9259540e135b"
/>

This PR addresses this mis-conversion by establishing bytes as the unit
for all per component memory metrics. The updated graph of idle worker
memory usage in the per component memory usage panel is shown below:
<img width="1689" height="398" alt="image"
src="https://github.com/user-attachments/assets/09437ccd-b7a3-43f4-a7cc-7a430710164e"
/>


---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
@pull pull Bot locked and limited conversation to collaborators Jun 11, 2026
@pull pull Bot added the ⤵️ pull label Jun 11, 2026
@pull pull Bot merged commit 51eb67f into garymm:master Jun 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants