Skip to content

Fix CI artifact download failures in sGPU/mGPU test jobs#602

Draft
VeeraRajasekhar wants to merge 5 commits into
devfrom
veergopu/upgrade_ci_to_rock
Draft

Fix CI artifact download failures in sGPU/mGPU test jobs#602
VeeraRajasekhar wants to merge 5 commits into
devfrom
veergopu/upgrade_ci_to_rock

Conversation

@VeeraRajasekhar
Copy link
Copy Markdown
Contributor

@VeeraRajasekhar VeeraRajasekhar commented May 29, 2026

Fixes intermittent CI failures in sGPU Tests (mi30x) and mGPU Torch (mi35x) jobs caused by partial artifact downloads on slow self-hosted runners.

Root Cause

All four build outputs were combined into a single te-rocm-wheels artifact (~743 MB). GitHub Actions upload-artifact@v4 stores files alphabetically in a zip, so the archive order was:

  1. transformer_engine-*.whl (~1 MB)
  2. transformer_engine_rocm7-*.whl (~700 MB)
  3. transformer_engine_rocm_jax-*.tar.gz (~20 MB)
  4. transformer_engine_rocm_torch-*.tar.gz (~20 MB)

On slower self-hosted runners, downloading the 743 MB archive took 5–9+ minutes. The download-artifact@v4 action reported success even when the download was truncated, because the whl files (items 1 and 2) were extracted first. The tar.gz sdist files (items 3 and 4), which appear after the 700 MB whl in the zip, were never extracted. This caused the "Install packages" step to fail silently with an empty TE_FW_PKG variable.

Fix

Split the single artifact into two:

  • te-rocm-wheels.whl files only (~700 MB)
  • te-rocm-sdists.tar.gz sdist files only (~40 MB)

The sdists are now downloaded in a separate, independent step. Because the sdist artifact is small (~40 MB), it downloads quickly and completely on all runners regardless of network speed, preventing the partial-extraction failure.

Changes

  • .github/workflows/rocm-wheels-build.yml: Replaced the single upload-artifact step with two steps — one for .whl packages (te-rocm-wheels) and one for .tar.gz sdists (te-rocm-sdists).
  • .github/workflows/rocm-ci.yml: Added a Download sdist packages step in both sgpu_tests and mgpu_tests jobs to download the new te-rocm-sdists artifact alongside the existing wheel download.

@VeeraRajasekhar VeeraRajasekhar self-assigned this May 29, 2026
@VeeraRajasekhar VeeraRajasekhar added the ci-level 3 CI test level 3 label May 29, 2026
Copilot AI changed the title upgrade ci to TheRock Fix CI artifact download failures in sGPU/mGPU test jobs Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants