Skip to content

[pull] master from ray-project:master#1067

Merged
pull[bot] merged 2 commits into
garymm:masterfrom
ray-project:master
Jun 11, 2026
Merged

[pull] master from ray-project:master#1067
pull[bot] merged 2 commits into
garymm:masterfrom
ray-project:master

Conversation

@pull

@pull pull Bot commented Jun 11, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

sampan-s-nayak and others added 2 commits June 11, 2026 14:51
Co-authored-by: sampan <sampan@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Metrics from the TPU device plugin have some quirks that cause issues
for the
dashboard:

1. Tensor core and memory bandwidth utilization are indexed by host
(i.e. zero
to N-1 where N is chips per host), while the other metrics are "runtime
metrics" and indexed across the complete running slice. They will
sometimes
   but almost never line up precisely.
2. V7X clusters I've tested against only have the host metrics, which
means the
dashboard must gracefully degrade to show memory utilization without
absolute
   byte info.

To work around this, the reporter agent can re-index the non-host
metrics to
share the same indices as the host metrics by assigning them in-order.

Additionally, I've cleaned the TPU path for the accelerator rows to
gracefully
degrade on the missing metrics paths.

## Related issues

- Addresses bugs in #63774
- To unblock chip-to-pid mapping in #63976

## Additional information

Example dashboard with correct chip ids and absolute memory:
<img width="1473" height="1057" alt="image"
src="https://github.com/user-attachments/assets/06b8d492-4db8-4247-99ea-72daef66716b"
/>


Example dashboard missing absolute GiB memory (some 0.0 rows show
different visual bars because it was animating while screenshotting):
<img width="1378" height="1120" alt="image"
src="https://github.com/user-attachments/assets/f807bcbf-0c7a-4fa7-a8e3-94c25ad16e0b"
/>

---------

Signed-off-by: Spencer Peterson <spencerjp@google.com>
@pull pull Bot locked and limited conversation to collaborators Jun 11, 2026
@pull pull Bot added the ⤵️ pull label Jun 11, 2026
@pull pull Bot merged commit ec864a2 into garymm:master Jun 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants