Skip to content

Merge develop branch into candidate#311

Closed
nileshnegi wants to merge 9 commits into
candidatefrom
merge/develop/candidate/05-22
Closed

Merge develop branch into candidate#311
nileshnegi wants to merge 9 commits into
candidatefrom
merge/develop/candidate/05-22

Conversation

@nileshnegi
Copy link
Copy Markdown
Collaborator

Motivation

candidate and develop branches have diverged.

Technical Details

Test Plan

Test Result

Submission Checklist

gilbertlee-amd and others added 9 commits March 26, 2026 00:29
gcc 8.5 (AlmaLinux 8 base, used by quay.io/pypa/manylinux_2_28_x86_64)
is older than what Ubuntu 22.04 ships and doesn't pull in some headers
transitively or expose std::filesystem from the default libstdc++.

- Drop unused #include <barrier> from TransferBench.hpp. <barrier> is
  C++20 and requires libstdc++ >= 10. std::barrier was never used.

- Add explicit #include <iomanip>, <numeric>, <limits> to Utilities.hpp,
  NicPeerToPeer.hpp, AllToAll.hpp, AllToAllN.hpp. Older libstdc++ does
  not pull these in via other standard headers.

- Link libstdc++fs unconditionally for std::filesystem. gcc < 9 ships
  it as a separate static archive in a gcc-private libdir (e.g.
  /usr/lib/gcc/x86_64-redhat-linux/8/libstdc++fs.a), which CMake's
  find_library does not search. Bare target_link_libraries(... stdc++fs)
  lets the compiler driver find it. On gcc 9+ it resolves to a no-op
  stub archive shipped for compatibility, so it's harmless elsewhere.

Verified by reproducing the build inside quay.io/pypa/manylinux_2_28_x86_64
locally.

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
* Add GitHub Actions workflow for relocatable package builds via TheRock

Adds a CI workflow that builds DEB/RPM/TGZ packages of TransferBench
against the TheRock nightly ROCm SDK, modeled on the
ROCmValidationSuite packaging workflow. Packages install to
/opt/rocm/extras-<MAJOR> with $ORIGIN-relative RPATH so they are
relocatable.

- build_packages_local.sh: single source of truth for both local and
  CI builds. Detects Ubuntu vs AlmaLinux/manylinux, installs deps,
  fetches TheRock SDK tarball, configures CMake with relocatable
  RPATH and the new BUILD_RELOCATABLE_PACKAGE option, builds, and
  invokes CPack for DEB/RPM/TGZ.
- .github/workflows/build-relocatable-packages.yml: parallel
  Ubuntu 22.04 + manylinux_2_28 jobs triggered on push, PR, daily
  cron, and workflow_dispatch. OIDC-based S3 upload gated on
  AWS_S3_BUCKET being set; apt/yum repo metadata generated for
  non-PR builds. Build report artifact summarizes S3 paths.
- .github/workflows/README_BUILD_PACKAGES.md: workflow docs covering
  triggers, local usage, S3 layout, IAM trust policy, and apt/yum
  install snippets.
- CMakeLists.txt: new BUILD_RELOCATABLE_PACKAGE option that bypasses
  rocm_install/rocm_create_package, names the package
  amdrocm<MAJOR>-transferbench, and honors caller-set install prefix
  and CPACK_*_PACKAGE_RELEASE env vars. Default cmake .. behavior is
  unchanged.
)

* ci: fix S3 metadata sync without requiring DeleteObject permission

Follow RVS pattern for apt/yum repo metadata generation:
- Download existing packages from S3 first
- Regenerate metadata for all packages (existing + new)
- Sync everything back (metadata files overwrite naturally)
- Remove --delete flag that required s3:DeleteObject permission

This fixes the AccessDenied error when syncing yum repodata:
"User: arn:aws:sts::317668459450:assumed-role/therock-rocm-transferbench-releases-s3-oidc/GitHubActions
is not authorized to perform: s3:DeleteObject"

Changes:
- apt metadata: Use aws s3 sync instead of individual cp commands
- yum metadata: Remove --delete flag, sync down existing RPMs first
- Both: Add --no-progress flag and human-readable output
- Both: Copy current build packages explicitly before metadata generation

Reference: ROCm/ROCmValidationSuite workflow (lines 175-204, 342-370)

* ci: address Copilot review feedback

- Remove || true from S3 sync commands to catch real errors
- Restrict apt sync to *.deb only (exclude .tar.gz downloads)
- Update RPM log message to reflect that RPMs + repodata are synced

Copilot feedback rationale:
1. Silent failures (|| true) can lead to incomplete metadata generation
2. Empty S3 prefixes typically succeed anyway, so explicit handling unnecessary
3. Downloading only *.deb files is more efficient and faster
4. Log message should accurately reflect what's being synced
* ci: add CodeQL security scanning workflow

Add CodeQL static analysis workflow following ROCm project standards
(amdsmi/aqlprofile pattern). Scans C/C++ code for security vulnerabilities.

- Runs on develop/mainline branch pushes and PRs
- Weekly scheduled scan on Fridays
- Uses security-extended query suite
- Builds with minimal dependencies (no NIC/MPI) for faster analysis

Part of TheRock component onboarding requirements.

* fix: use ROCm container for CodeQL build

CodeQL analysis needs ROCm/HIP installed to build TransferBench.
Switch to rocm/dev-ubuntu-22.04 container following aqlprofile pattern.

- Add git installation in container
- Configure git safe directory
- Add CMAKE_PREFIX_PATH=/opt/rocm for hip-config.cmake discovery

* ci: add 'candidate' branch to CodeQL trigger list

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* ci: address Copilot review feedback on CodeQL workflow

- Add -y flag to apt-add-repository to avoid interactive prompts
- Replace safe.directory wildcard with GITHUB_WORKSPACE for minimal permissions

Rationale:
1. Interactive prompts can hang CI jobs waiting for user input
2. Using '*' for safe.directory is unnecessarily permissive; GITHUB_WORKSPACE
   provides sufficient access while maintaining defense-in-depth

Note: Container image intentionally remains unpinned per maintainer preference

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* ci: align relocatable package flow with RVS reference

Brings TransferBench's TheRock-SDK-based package build into closer parity
with the ROCmValidationSuite reference flow:

- CMakeLists.txt: auto-compute patch version from `git describe --tags
  --match v<MAJOR>.<MINOR>.*` (commits since the last matching tag),
  falling back to the hardcoded patch when no tag is reachable. No
  behavior change today since no v1.66.* tag exists yet; once one is
  created, builds become 1.66.<commits-since>.
- CMakeLists.txt: move relocatable RPATH defaults out of
  build_packages_local.sh and into the BUILD_RELOCATABLE_PACKAGE block
  so plain `cmake -DBUILD_RELOCATABLE_PACKAGE=ON ..` produces the same
  RPATH as a packaged build. RPATH list now includes /opt/rocm/lib/llvm,
  /opt/rocm/core-<MAJOR>/lib, and the matching llvm path.
- build_packages_local.sh: package release tag now follows the RVS
  format (`r<libpatch>.<yyyymmdd>` for default builds,
  `r<libpatch>.<yyyymmdd>.<src-branch>.<commit>` for PRs,
  GITHUB_RUN_NUMBER for release/* branches). UTC date captured once so
  long builds can't straddle midnight. Drops the over-permissive `rel*`
  prefix matcher.
- build_packages_local.sh: drop the now-redundant -DCMAKE_INSTALL_RPATH
  / -DCMAKE_SKIP_RPATH / -DCMAKE_INSTALL_RPATH_USE_LINK_PATH flags
  (CMakeLists.txt is now the single source of truth).
- docs/install/INSTALL_TGZ.rst: new user-facing TGZ install guide
  covering ROCm pre-install, runtime deps, extraction, PATH /
  LD_LIBRARY_PATH setup, and persistent profile.d configuration.
- docs/install/install.rst: cross-link to the new TGZ doc.
- README_BUILD_PACKAGES.md: replace the inline TGZ install snippet with
  a link to the new doc plus a minimal smoke test for CI maintainers.

* docs: drop --help (does not exist on develop) for plain TransferBench

Address @gilbertlee-amd review on PR #287:

  --help doesn't exist. In candidate we already introduce the "help"
  preset [...]. For CI moving forward, we likely want to use
  ./TransferBench smoketest. However, that's also in candidate branch
  still.

On develop today, running TransferBench with no arguments prints
version, usage, the available preset list, and the detected topology
then exits 0 (Client.cpp:41 — `if (argc <= 1)` branch). That is a valid
load-time smoke test for the binary.

Replace `TransferBench --help` with `TransferBench` in:
- .github/workflows/README_BUILD_PACKAGES.md (CI maintainer smoke test)
- docs/install/INSTALL_TGZ.rst (end-user verify step)

Add a forward-looking note in the README that once the `help` and
`smoketest` presets graduate from candidate to develop, the smoke test
should switch to `TransferBench smoketest`.
* ci: fix package versioning across DEB/RPM/TGZ builds

Two issues surfaced from run 25393634415 on release/test_artifact:

1. manylinux container falls back to TRANSFERBENCH_VERSION_PATCH="02"
   because git's "dubious ownership" guard rejects the host-UID-owned
   workspace when the container runs as root. Same commit produced
   1.66.11 on Ubuntu and 1.66.02 on manylinux. Mark the repo safe for
   the container's root user before CMake's git probe runs.

2. CPack writes <name>-<version>-Linux.{deb,rpm,tar.gz}, omitting the
   release tag. Successive release/* runs overwrite each other in S3
   and the run number is invisible from the filename. Switch DEB/RPM
   to canonical CPack filenames (which embed release + arch) and
   thread the release tag into CPACK_ARCHIVE_FILE_NAME for the TGZ.

* ci: pass package release as -D arg, not just env

Run 25819267330 confirmed the version-sync fix works (both jobs report
1.66.13) and that DEB/RPM filenames now embed the release. But the
Ubuntu TGZ filename still came out as `…-1.66.13-Linux.tar.gz` — no
release tag — while manylinux's TGZ embedded it correctly.

Root cause: CMakeLists derived the release tag from
`ENV{CPACK_RPM_PACKAGE_RELEASE}` at configure time. Ubuntu invokes the
script via `sudo -E …`, and although the script then `export`s those
env vars, they don't reliably reach CMake's environment on the sudo
path. The DEB filename still worked because `DEB-DEFAULT` is processed
by CPack at package time, not configure time, so the env var
propagated to the `cpack -G DEB` child directly.

Fix: pass `PKG_RELEASE` as `-DTRANSFERBENCH_PACKAGE_RELEASE=…` from
the build script. CMake reads it directly from the cache, bypassing
env propagation entirely. Env-var fallbacks remain so direct cmake
invocations (without the wrapper script) keep working.

* ci: temporary debug print to diagnose Ubuntu TGZ release-tag miss

* ci: dump CPackConfig.cmake archive name to diagnose Ubuntu TGZ

* ci: also set CPACK_PACKAGE_FILE_NAME for TGZ on older CMake

CMake 3.26 (manylinux container) honors CPACK_ARCHIVE_FILE_NAME for
the TGZ generator. CMake 3.22 (Ubuntu 22.04 system cmake) falls back
to CPACK_PACKAGE_FILE_NAME instead, ignoring my ARCHIVE_FILE_NAME
override. Diagnostic dump confirmed both jobs wrote identical
CPackConfig.cmake — the divergence is purely in cpack runtime
behavior between the two CMake versions.

Set both to the same suffixed value. DEB/RPM are unaffected because
they use the explicit DEB-DEFAULT / RPM-DEFAULT canonical-naming
tokens, which take precedence over CPACK_PACKAGE_FILE_NAME.

Also remove the temporary debug prints from CMakeLists.txt and the
CPackConfig.cmake dump from build_packages_local.sh.

* ci: address Copilot PR review on packaging changes

- Replace `git config --global --add safe.directory` with the GIT_CONFIG_*
  env-var triple so the safe.directory entry is scoped to this build (and
  inherited by CMake's `execute_process(git …)` children) instead of being
  written to root's ~/.gitconfig under sudo. Also drops the `|| true` so a
  setup failure surfaces immediately instead of letting the version probe
  silently fall back.

- Quote `"${VAR}"` in `STREQUAL ""` comparisons. CMake's `if(AND ...)` does
  not short-circuit, and bare-identifier dereferencing inside `if()` depends
  on policy CMP0054. Quoting makes the comparison unambiguous.

- Fail fast when BUILD_RELOCATABLE_PACKAGE is requested on CMake older than
  3.13. The block uses CPACK_ARCHIVE_FILE_NAME (3.13+) and the
  DEB-DEFAULT / RPM-DEFAULT canonical-naming sentinels (3.6+); on a 3.5
  CMake those would silently produce a literal "DEB-DEFAULT" filename. The
  project-wide cmake_minimum_required of 3.5 is left alone (out of scope
  here); the gate is local to the relocatable-package path.
The relocatable DEB/RPM hard-depended on hsa-rocr, but ROCm 7.13 no
longer ships a package by that name -- install fails on TheRock 7.13
with "amdrocm7-transferbench depends on hsa-rocr; however: Package
hsa-rocr is not configured yet" (ROCM-24669).

The dep also never made sense for the relocatable artifact: it is
built from the TheRock SDK and meant to install on systems where ROCm
came from a tarball and is not tracked by apt/dpkg at all. Any apt
dep on a ROCm component breaks that audience.

Drop hsa-rocr from CPACK_DEBIAN_PACKAGE_DEPENDS,
CPACK_RPM_PACKAGE_REQUIRES, and the legacy
rocm_package_add_dependencies call. Wire a shared scriptlet
(packaging/postinst-check-hsa.sh) as the DEB postinst and the RPM
%post that probes ldconfig + the standard /opt/rocm{,-*,/extras-*,
/core-*}/lib prefixes for libhsa-runtime64.so.1 and prints a stderr
warning when none of them have it. Always exits 0 so install never
fails on its account -- the warning surfaces the missing runtime at
install time instead of as a dynamic-linker error on first run.

numactl / libnuma1 stay in the dep list since those are real OS
packages that exist independent of any ROCm install.
@nileshnegi nileshnegi requested a review from a team as a code owner May 23, 2026 04:01
Copilot AI review requested due to automatic review settings May 23, 2026 04:01
@nileshnegi nileshnegi requested a review from a team as a code owner May 23, 2026 04:01
@nileshnegi nileshnegi removed request for a team May 23, 2026 04:02
@nileshnegi nileshnegi requested a review from gilbertlee-amd May 23, 2026 04:02
@nileshnegi nileshnegi closed this May 23, 2026
@nileshnegi nileshnegi review requested due to automatic review settings May 23, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants