Merge develop branch into candidate#311
Closed
nileshnegi wants to merge 9 commits into
Closed
Conversation
gcc 8.5 (AlmaLinux 8 base, used by quay.io/pypa/manylinux_2_28_x86_64) is older than what Ubuntu 22.04 ships and doesn't pull in some headers transitively or expose std::filesystem from the default libstdc++. - Drop unused #include <barrier> from TransferBench.hpp. <barrier> is C++20 and requires libstdc++ >= 10. std::barrier was never used. - Add explicit #include <iomanip>, <numeric>, <limits> to Utilities.hpp, NicPeerToPeer.hpp, AllToAll.hpp, AllToAllN.hpp. Older libstdc++ does not pull these in via other standard headers. - Link libstdc++fs unconditionally for std::filesystem. gcc < 9 ships it as a separate static archive in a gcc-private libdir (e.g. /usr/lib/gcc/x86_64-redhat-linux/8/libstdc++fs.a), which CMake's find_library does not search. Bare target_link_libraries(... stdc++fs) lets the compiler driver find it. On gcc 9+ it resolves to a no-op stub archive shipped for compatibility, so it's harmless elsewhere. Verified by reproducing the build inside quay.io/pypa/manylinux_2_28_x86_64 locally. Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
* Add GitHub Actions workflow for relocatable package builds via TheRock Adds a CI workflow that builds DEB/RPM/TGZ packages of TransferBench against the TheRock nightly ROCm SDK, modeled on the ROCmValidationSuite packaging workflow. Packages install to /opt/rocm/extras-<MAJOR> with $ORIGIN-relative RPATH so they are relocatable. - build_packages_local.sh: single source of truth for both local and CI builds. Detects Ubuntu vs AlmaLinux/manylinux, installs deps, fetches TheRock SDK tarball, configures CMake with relocatable RPATH and the new BUILD_RELOCATABLE_PACKAGE option, builds, and invokes CPack for DEB/RPM/TGZ. - .github/workflows/build-relocatable-packages.yml: parallel Ubuntu 22.04 + manylinux_2_28 jobs triggered on push, PR, daily cron, and workflow_dispatch. OIDC-based S3 upload gated on AWS_S3_BUCKET being set; apt/yum repo metadata generated for non-PR builds. Build report artifact summarizes S3 paths. - .github/workflows/README_BUILD_PACKAGES.md: workflow docs covering triggers, local usage, S3 layout, IAM trust policy, and apt/yum install snippets. - CMakeLists.txt: new BUILD_RELOCATABLE_PACKAGE option that bypasses rocm_install/rocm_create_package, names the package amdrocm<MAJOR>-transferbench, and honors caller-set install prefix and CPACK_*_PACKAGE_RELEASE env vars. Default cmake .. behavior is unchanged.
) * ci: fix S3 metadata sync without requiring DeleteObject permission Follow RVS pattern for apt/yum repo metadata generation: - Download existing packages from S3 first - Regenerate metadata for all packages (existing + new) - Sync everything back (metadata files overwrite naturally) - Remove --delete flag that required s3:DeleteObject permission This fixes the AccessDenied error when syncing yum repodata: "User: arn:aws:sts::317668459450:assumed-role/therock-rocm-transferbench-releases-s3-oidc/GitHubActions is not authorized to perform: s3:DeleteObject" Changes: - apt metadata: Use aws s3 sync instead of individual cp commands - yum metadata: Remove --delete flag, sync down existing RPMs first - Both: Add --no-progress flag and human-readable output - Both: Copy current build packages explicitly before metadata generation Reference: ROCm/ROCmValidationSuite workflow (lines 175-204, 342-370) * ci: address Copilot review feedback - Remove || true from S3 sync commands to catch real errors - Restrict apt sync to *.deb only (exclude .tar.gz downloads) - Update RPM log message to reflect that RPMs + repodata are synced Copilot feedback rationale: 1. Silent failures (|| true) can lead to incomplete metadata generation 2. Empty S3 prefixes typically succeed anyway, so explicit handling unnecessary 3. Downloading only *.deb files is more efficient and faster 4. Log message should accurately reflect what's being synced
* ci: add CodeQL security scanning workflow Add CodeQL static analysis workflow following ROCm project standards (amdsmi/aqlprofile pattern). Scans C/C++ code for security vulnerabilities. - Runs on develop/mainline branch pushes and PRs - Weekly scheduled scan on Fridays - Uses security-extended query suite - Builds with minimal dependencies (no NIC/MPI) for faster analysis Part of TheRock component onboarding requirements. * fix: use ROCm container for CodeQL build CodeQL analysis needs ROCm/HIP installed to build TransferBench. Switch to rocm/dev-ubuntu-22.04 container following aqlprofile pattern. - Add git installation in container - Configure git safe directory - Add CMAKE_PREFIX_PATH=/opt/rocm for hip-config.cmake discovery * ci: add 'candidate' branch to CodeQL trigger list * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * ci: address Copilot review feedback on CodeQL workflow - Add -y flag to apt-add-repository to avoid interactive prompts - Replace safe.directory wildcard with GITHUB_WORKSPACE for minimal permissions Rationale: 1. Interactive prompts can hang CI jobs waiting for user input 2. Using '*' for safe.directory is unnecessarily permissive; GITHUB_WORKSPACE provides sufficient access while maintaining defense-in-depth Note: Container image intentionally remains unpinned per maintainer preference --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* ci: align relocatable package flow with RVS reference Brings TransferBench's TheRock-SDK-based package build into closer parity with the ROCmValidationSuite reference flow: - CMakeLists.txt: auto-compute patch version from `git describe --tags --match v<MAJOR>.<MINOR>.*` (commits since the last matching tag), falling back to the hardcoded patch when no tag is reachable. No behavior change today since no v1.66.* tag exists yet; once one is created, builds become 1.66.<commits-since>. - CMakeLists.txt: move relocatable RPATH defaults out of build_packages_local.sh and into the BUILD_RELOCATABLE_PACKAGE block so plain `cmake -DBUILD_RELOCATABLE_PACKAGE=ON ..` produces the same RPATH as a packaged build. RPATH list now includes /opt/rocm/lib/llvm, /opt/rocm/core-<MAJOR>/lib, and the matching llvm path. - build_packages_local.sh: package release tag now follows the RVS format (`r<libpatch>.<yyyymmdd>` for default builds, `r<libpatch>.<yyyymmdd>.<src-branch>.<commit>` for PRs, GITHUB_RUN_NUMBER for release/* branches). UTC date captured once so long builds can't straddle midnight. Drops the over-permissive `rel*` prefix matcher. - build_packages_local.sh: drop the now-redundant -DCMAKE_INSTALL_RPATH / -DCMAKE_SKIP_RPATH / -DCMAKE_INSTALL_RPATH_USE_LINK_PATH flags (CMakeLists.txt is now the single source of truth). - docs/install/INSTALL_TGZ.rst: new user-facing TGZ install guide covering ROCm pre-install, runtime deps, extraction, PATH / LD_LIBRARY_PATH setup, and persistent profile.d configuration. - docs/install/install.rst: cross-link to the new TGZ doc. - README_BUILD_PACKAGES.md: replace the inline TGZ install snippet with a link to the new doc plus a minimal smoke test for CI maintainers. * docs: drop --help (does not exist on develop) for plain TransferBench Address @gilbertlee-amd review on PR #287: --help doesn't exist. In candidate we already introduce the "help" preset [...]. For CI moving forward, we likely want to use ./TransferBench smoketest. However, that's also in candidate branch still. On develop today, running TransferBench with no arguments prints version, usage, the available preset list, and the detected topology then exits 0 (Client.cpp:41 — `if (argc <= 1)` branch). That is a valid load-time smoke test for the binary. Replace `TransferBench --help` with `TransferBench` in: - .github/workflows/README_BUILD_PACKAGES.md (CI maintainer smoke test) - docs/install/INSTALL_TGZ.rst (end-user verify step) Add a forward-looking note in the README that once the `help` and `smoketest` presets graduate from candidate to develop, the smoke test should switch to `TransferBench smoketest`.
* ci: fix package versioning across DEB/RPM/TGZ builds
Two issues surfaced from run 25393634415 on release/test_artifact:
1. manylinux container falls back to TRANSFERBENCH_VERSION_PATCH="02"
because git's "dubious ownership" guard rejects the host-UID-owned
workspace when the container runs as root. Same commit produced
1.66.11 on Ubuntu and 1.66.02 on manylinux. Mark the repo safe for
the container's root user before CMake's git probe runs.
2. CPack writes <name>-<version>-Linux.{deb,rpm,tar.gz}, omitting the
release tag. Successive release/* runs overwrite each other in S3
and the run number is invisible from the filename. Switch DEB/RPM
to canonical CPack filenames (which embed release + arch) and
thread the release tag into CPACK_ARCHIVE_FILE_NAME for the TGZ.
* ci: pass package release as -D arg, not just env
Run 25819267330 confirmed the version-sync fix works (both jobs report
1.66.13) and that DEB/RPM filenames now embed the release. But the
Ubuntu TGZ filename still came out as `…-1.66.13-Linux.tar.gz` — no
release tag — while manylinux's TGZ embedded it correctly.
Root cause: CMakeLists derived the release tag from
`ENV{CPACK_RPM_PACKAGE_RELEASE}` at configure time. Ubuntu invokes the
script via `sudo -E …`, and although the script then `export`s those
env vars, they don't reliably reach CMake's environment on the sudo
path. The DEB filename still worked because `DEB-DEFAULT` is processed
by CPack at package time, not configure time, so the env var
propagated to the `cpack -G DEB` child directly.
Fix: pass `PKG_RELEASE` as `-DTRANSFERBENCH_PACKAGE_RELEASE=…` from
the build script. CMake reads it directly from the cache, bypassing
env propagation entirely. Env-var fallbacks remain so direct cmake
invocations (without the wrapper script) keep working.
* ci: temporary debug print to diagnose Ubuntu TGZ release-tag miss
* ci: dump CPackConfig.cmake archive name to diagnose Ubuntu TGZ
* ci: also set CPACK_PACKAGE_FILE_NAME for TGZ on older CMake
CMake 3.26 (manylinux container) honors CPACK_ARCHIVE_FILE_NAME for
the TGZ generator. CMake 3.22 (Ubuntu 22.04 system cmake) falls back
to CPACK_PACKAGE_FILE_NAME instead, ignoring my ARCHIVE_FILE_NAME
override. Diagnostic dump confirmed both jobs wrote identical
CPackConfig.cmake — the divergence is purely in cpack runtime
behavior between the two CMake versions.
Set both to the same suffixed value. DEB/RPM are unaffected because
they use the explicit DEB-DEFAULT / RPM-DEFAULT canonical-naming
tokens, which take precedence over CPACK_PACKAGE_FILE_NAME.
Also remove the temporary debug prints from CMakeLists.txt and the
CPackConfig.cmake dump from build_packages_local.sh.
* ci: address Copilot PR review on packaging changes
- Replace `git config --global --add safe.directory` with the GIT_CONFIG_*
env-var triple so the safe.directory entry is scoped to this build (and
inherited by CMake's `execute_process(git …)` children) instead of being
written to root's ~/.gitconfig under sudo. Also drops the `|| true` so a
setup failure surfaces immediately instead of letting the version probe
silently fall back.
- Quote `"${VAR}"` in `STREQUAL ""` comparisons. CMake's `if(AND ...)` does
not short-circuit, and bare-identifier dereferencing inside `if()` depends
on policy CMP0054. Quoting makes the comparison unambiguous.
- Fail fast when BUILD_RELOCATABLE_PACKAGE is requested on CMake older than
3.13. The block uses CPACK_ARCHIVE_FILE_NAME (3.13+) and the
DEB-DEFAULT / RPM-DEFAULT canonical-naming sentinels (3.6+); on a 3.5
CMake those would silently produce a literal "DEB-DEFAULT" filename. The
project-wide cmake_minimum_required of 3.5 is left alone (out of scope
here); the gate is local to the relocatable-package path.
The relocatable DEB/RPM hard-depended on hsa-rocr, but ROCm 7.13 no
longer ships a package by that name -- install fails on TheRock 7.13
with "amdrocm7-transferbench depends on hsa-rocr; however: Package
hsa-rocr is not configured yet" (ROCM-24669).
The dep also never made sense for the relocatable artifact: it is
built from the TheRock SDK and meant to install on systems where ROCm
came from a tarball and is not tracked by apt/dpkg at all. Any apt
dep on a ROCm component breaks that audience.
Drop hsa-rocr from CPACK_DEBIAN_PACKAGE_DEPENDS,
CPACK_RPM_PACKAGE_REQUIRES, and the legacy
rocm_package_add_dependencies call. Wire a shared scriptlet
(packaging/postinst-check-hsa.sh) as the DEB postinst and the RPM
%post that probes ldconfig + the standard /opt/rocm{,-*,/extras-*,
/core-*}/lib prefixes for libhsa-runtime64.so.1 and prints a stderr
warning when none of them have it. Always exits 0 so install never
fails on its account -- the warning surfaces the missing runtime at
install time instead of as a dynamic-linker error on first run.
numactl / libnuma1 stay in the dep list since those are real OS
packages that exist independent of any ROCm install.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
candidateanddevelopbranches have diverged.Technical Details
Test Plan
Test Result
Submission Checklist