[Feat/Perf] Hermetic UCX Build with CUDA Support and AM Performance Optimizations by MoFHeka · Pull Request #20 · MoFHeka/execution-ucx

MoFHeka · 2026-05-11T14:15:57Z

Description:
This PR establishes a fully hermetic build process for UCX within the Bazel environment and introduces significant performance optimizations for the Active Message (AM) runtime. It also enables GPU-aware communication through properly integrated CUDA support.

Key Changes:

Hermetic UCX Build: Refactored build_ucx.sh and Bazel build rules to eliminate all host-system dependency fallbacks. UCX is now built strictly within the Bazel sandbox using the project's hermetic toolchain.
CUDA Integration: Switched to @rules_ml_toolchain for robust CUDA detection. Ensured libcuda.so and related headers are correctly exposed to the UCX build process, enabling seamless GPU memory communication.
AM Runtime Optimization:
- Refactored ucx_am_context and ucx_connection for lower latency and better resource management.
- Optimized callback handling in AM send operations to reduce overhead.
- Fixed socket address initialization (sin_zero zero-out) to ensure reliable connection establishment.
Performance Benchmarking:
- Introduced //axon/python:benchmark_active_message to validate throughput and latency across both host and device memory.
- Standardized performance measurement metrics for performance regression tracking.
Engineering Quality: Modularized third_party/openucx build logic and consolidated .bazelrc configurations for better maintainability and reproducibility.

…k tool

…mory safety. Which is "pure virtual method called" may happened in new benchmark script axon/python/tests/benchmark_active_message.py.

…nt throughput This commit completely eliminates dynamic heap allocations in the hot path for Active Message operations, significantly reducing allocator lock contention under high concurrency. Key optimizations: 1. Replaced `std::unique_ptr` with `std::optional` for `CqeEntryCallback` and `DirectEntryCallback` in `ucx_am_context` sender/receiver operations, embedding the callback lifecycle directly into the unifex operation state. 2. Refactored `InvokeContext` (Python bindings) to utilize stack-allocated `std::array` instead of `std::vector` for the standard fast-path (<= 4 tensors), falling back to dynamic allocation only for large RPC arguments. 3. Updated `UcxConnection` to accept raw callback pointers instead of transferring ownership via `std::unique_ptr`. Performance gains (measured via concurrent throughput benchmark): - Eager-Tiny (64 B): 342.0 QPS -> 487.8 QPS (+42.6%) - Eager-Small (4 KiB): 288.6 QPS -> 450.6 QPS (+56.1%) These architectural zero-copy/stack-only changes significantly elevate the RPC throughput ceiling and reduce tail latency jitter in highly concurrent distributed training workloads.

…dd active message benchmark ablation suite.

…for UCX and update CUDA configuration logic.

gemini-code-assist

Code Review

This pull request refactors the Bazel build system to implement a hermetic ML toolchain for C++ and CUDA, modularizes configuration files, and introduces a custom native build rule for UCX. It also adds a Python performance benchmark and improves GIL safety during interpreter shutdown. Feedback focuses on improving the UCX build script by respecting Bazel's compiler environment and resource management, ensuring strict hermeticity for CUDA tools, and optimizing Python object lifecycle management during shutdown.

…pre-built .deb libraries and building missing components from source. Fix test bug.

… suppress unused parameter warnings.

…ild process with component stripping.

…add Python package configuration for axon.

…or safety

…nd remove obsolete scripts

…and introduce utility scripts for development and benchmarking.

… hooks.

MoFHeka · 2026-05-15T09:09:32Z

/gemini review

gemini-code-assist

Code Review

This pull request implements a major update to the build system and memory management architecture, transitioning to modular Bazel configurations and a hermetic ML toolchain. A key feature is the introduction of UcxAllocatorContext to safely manage the lifecycle of memory resources, preventing segmentation faults when objects outlive the AxonRuntime. The Python bindings have also been restructured for better distribution and performance. Technical feedback identifies several critical issues, including a data race on the is_alive lifecycle flag, a memory leak in the UcxMemoryResourceManager, and potential hermeticity breaks in the UCX build scripts where host tools or incorrect linker flags might be used.

gemini-code-assist · 2026-05-15T09:14:07Z

+
+struct UcxAllocatorContext {
+  UcxMemoryResourceManager* mr;
+  bool is_alive;


The is_alive flag is accessed from multiple threads (e.g., the thread destroying the UcxMemoryResourceManager and threads destroying UcxBuffer objects) without synchronization. This constitutes a data race. Please change is_alive to std::atomic<bool> to ensure thread-safe access.

Suggested change

bool is_alive;

std::atomic<bool> is_alive;

gemini-code-assist · 2026-05-15T09:14:07Z

-  UcxMemoryResourceManager() = default;
-  virtual ~UcxMemoryResourceManager() = default;
+  UcxMemoryResourceManager()
+    : allocator_context_(new UcxAllocatorContext{this, true}) {}


The allocator_context_ is allocated with new in the constructor but is never deleted in the destructor (line 51 only sets is_alive = false). This results in a memory leak of the UcxAllocatorContext object every time a UcxMemoryResourceManager is destroyed. While the object is small, it should be properly managed, for example by using a std::unique_ptr if ownership is strictly with the MR, or a std::shared_ptr if buffers need to keep the context alive after the MR is gone.

gemini-code-assist · 2026-05-15T09:14:07Z

+        action_name = "c-compile",
+        variables = cc_common.create_compile_variables(
+            feature_configuration = feature_configuration,
+            cc_toolchain = cc_toolchain,


The UCX build process involves compiling many C files. The current rule only passes ctx.fragments.cpp.copts to the environment. For a C project like UCX, you should also include ctx.fragments.cpp.conlyopts to ensure that C-specific flags (like -std=gnu17 defined in .bazelrc) are correctly applied during the hermetic build.

Suggested change

cc_toolchain = cc_toolchain,

user_compile_flags = ctx.fragments.cpp.copts + ctx.fragments.cpp.conlyopts,

gemini-code-assist · 2026-05-15T09:14:07Z

+	elif command -v nvcc &>/dev/null; then
+		export NVCC=$(command -v nvcc)
+	fi


The fallback to the host's nvcc via command -v nvcc violates the goal of a fully hermetic build process. If CUDA is enabled but the hermetic nvcc is not found in the Bazel sandbox, the build should fail rather than picking up a potentially incompatible version from the host system.

gemini-code-assist · 2026-05-15T09:14:07Z

+		$CC -shared -fPIC -Wno-error \
+			-I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \
+			-o "$EXECROOT/sysroot/lib/libxpmem.so" \
+			"$XPMEM_SRC"
+		echo "  Built libxpmem.so"


When building libxpmem.so from source, the script is missing the $LDFLAGS environment variable. This might cause the compiler to use the host linker instead of the hermetic LLVM linker (ld.lld) configured earlier in the script, potentially leading to ABI issues or non-reproducible builds.

Suggested change

$CC -shared -fPIC -Wno-error \

-I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \

-o "$EXECROOT/sysroot/lib/libxpmem.so" \

"$XPMEM_SRC"

echo " Built libxpmem.so"

$CC -shared -fPIC $LDFLAGS -Wno-error \

-I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \

-o "$EXECROOT/sysroot/lib/libxpmem.so" \

"$XPMEM_SRC"

MoFHeka added 6 commits May 9, 2026 17:29

[Feat] Non-Bloack send and recv

899d631

feat: optimize active message eager path and add performance benchmar…

d47c2e3

…k tool

[Feat] Propagate UcxMemoryResourceManager via shared_ptr to ensure me…

6a65d26

…mory safety. Which is "pure virtual method called" may happened in new benchmark script axon/python/tests/benchmark_active_message.py.

[Refactor] Update memory manager references to const shared_ptr and a…

3efa9bf

…dd active message benchmark ablation suite.

[Refactor] Replace rules_foreign_cc with custom native build process …

4b37555

…for UCX and update CUDA configuration logic.

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

MoFHeka added 4 commits May 12, 2026 00:15

[Fix] Initialize sin_zero field in sockaddr_in structures within tests.

38590ae

[Refactor] Transition to hermetic dependency management by importing …

e176f8d

…pre-built .deb libraries and building missing components from source. Fix test bug.

[Feat] Upgrade toolchain to hermetic LLVM 20, add helper scripts, and…

e8ab9e2

… suppress unused parameter warnings.

[Feat] Introduce environment setup scripts and harden hermetic UCX bu…

6777197

…ild process with component stripping.

MoFHeka force-pushed the axon_performance branch from 25be188 to 6777197 Compare May 13, 2026 01:50

MoFHeka added 6 commits May 13, 2026 17:45

[Refactor] Introduce runtime plugin separation for UCX libraries and …

b57d26f

…add Python package configuration for axon.

chore: update test_utils_lib imports.

8a5052b

[Fix] so path

43f3a6d

[Refactor] Replace UcxAllocatorContext* with std::reference_wrapper f…

73a9bf7

…or safety

[Refactor] suppress unused parameter warnings with [[maybe_unused]] a…

7448dcb

…nd remove obsolete scripts

[Feat] update UcxAmData constructors to accept UCX contexts

cc5784d

MoFHeka force-pushed the axon_performance branch from f59339f to cc5784d Compare May 13, 2026 18:04

Florentino-Ding and others added 3 commits May 14, 2026 21:51

[Fix] Fix template symbol error and ld path error.

cc21e11

[Refactor] Optimize wheel file packaging, remove legacy setup logic, …

3b64af5

…and introduce utility scripts for development and benchmarking.

[Refactor] Improve UCX memory management, fix DLOpen flags for memory…

914bb65

… hooks.

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

[Feat] Enforce hermetic nvcc, and make allocator context thread-safe.

6938512

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat/Perf] Hermetic UCX Build with CUDA Support and AM Performance Optimizations#20

[Feat/Perf] Hermetic UCX Build with CUDA Support and AM Performance Optimizations#20
MoFHeka wants to merge 20 commits into
mainfrom
axon_performance

MoFHeka commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MoFHeka commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	cc_toolchain = cc_toolchain,
	user_compile_flags = ctx.fragments.cpp.copts + ctx.fragments.cpp.conlyopts,

Conversation

MoFHeka commented May 11, 2026

Key Changes:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MoFHeka commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants