Skip to content

[Feat/Perf] Hermetic UCX Build with CUDA Support and AM Performance Optimizations#20

Open
MoFHeka wants to merge 20 commits into
mainfrom
axon_performance
Open

[Feat/Perf] Hermetic UCX Build with CUDA Support and AM Performance Optimizations#20
MoFHeka wants to merge 20 commits into
mainfrom
axon_performance

Conversation

@MoFHeka
Copy link
Copy Markdown
Owner

@MoFHeka MoFHeka commented May 11, 2026

Description:
This PR establishes a fully hermetic build process for UCX within the Bazel environment and introduces significant performance optimizations for the Active Message (AM) runtime. It also enables GPU-aware communication through properly integrated CUDA support.

Key Changes:

  • Hermetic UCX Build: Refactored build_ucx.sh and Bazel build rules to eliminate all host-system dependency fallbacks. UCX is now built strictly within the Bazel sandbox using the project's hermetic toolchain.
  • CUDA Integration: Switched to @rules_ml_toolchain for robust CUDA detection. Ensured libcuda.so and related headers are correctly exposed to the UCX build process, enabling seamless GPU memory communication.
  • AM Runtime Optimization:
    • Refactored ucx_am_context and ucx_connection for lower latency and better resource management.
    • Optimized callback handling in AM send operations to reduce overhead.
    • Fixed socket address initialization (sin_zero zero-out) to ensure reliable connection establishment.
  • Performance Benchmarking:
    • Introduced //axon/python:benchmark_active_message to validate throughput and latency across both host and device memory.
    • Standardized performance measurement metrics for performance regression tracking.
  • Engineering Quality: Modularized third_party/openucx build logic and consolidated .bazelrc configurations for better maintainability and reproducibility.

MoFHeka added 6 commits May 9, 2026 17:29
…mory safety. Which is "pure virtual method called" may happened in new benchmark script axon/python/tests/benchmark_active_message.py.
…nt throughput

This commit completely eliminates dynamic heap allocations in the hot path for Active Message operations, significantly reducing allocator lock contention under high concurrency.

Key optimizations:
1. Replaced `std::unique_ptr` with `std::optional` for `CqeEntryCallback` and `DirectEntryCallback` in `ucx_am_context` sender/receiver operations, embedding the callback lifecycle directly into the unifex operation state.
2. Refactored `InvokeContext` (Python bindings) to utilize stack-allocated `std::array` instead of `std::vector` for the standard fast-path (<= 4 tensors), falling back to dynamic allocation only for large RPC arguments.
3. Updated `UcxConnection` to accept raw callback pointers instead of transferring ownership via `std::unique_ptr`.

Performance gains (measured via concurrent throughput benchmark):
- Eager-Tiny (64 B): 342.0 QPS -> 487.8 QPS (+42.6%)
- Eager-Small (4 KiB): 288.6 QPS -> 450.6 QPS (+56.1%)

These architectural zero-copy/stack-only changes significantly elevate the RPC throughput ceiling and reduce tail latency jitter in highly concurrent distributed training workloads.
…for UCX and update CUDA configuration logic.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Bazel build system to implement a hermetic ML toolchain for C++ and CUDA, modularizes configuration files, and introduces a custom native build rule for UCX. It also adds a Python performance benchmark and improves GIL safety during interpreter shutdown. Feedback focuses on improving the UCX build script by respecting Bazel's compiler environment and resource management, ensuring strict hermeticity for CUDA tools, and optimizing Python object lifecycle management during shutdown.

Comment thread third_party/openucx/build_ucx.sh Outdated
Comment thread third_party/openucx/build_ucx.sh Outdated
Comment thread third_party/openucx/build_ucx.sh Outdated
Comment thread third_party/openucx/build_ucx.sh
Comment thread third_party/openucx/native_ucx_build.bzl
Comment thread axon/python/src/bindings_runtime.cpp
@MoFHeka MoFHeka force-pushed the axon_performance branch from 25be188 to 6777197 Compare May 13, 2026 01:50
@MoFHeka MoFHeka force-pushed the axon_performance branch from f59339f to cc5784d Compare May 13, 2026 18:04
@MoFHeka
Copy link
Copy Markdown
Owner Author

MoFHeka commented May 15, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a major update to the build system and memory management architecture, transitioning to modular Bazel configurations and a hermetic ML toolchain. A key feature is the introduction of UcxAllocatorContext to safely manage the lifecycle of memory resources, preventing segmentation faults when objects outlive the AxonRuntime. The Python bindings have also been restructured for better distribution and performance. Technical feedback identifies several critical issues, including a data race on the is_alive lifecycle flag, a memory leak in the UcxMemoryResourceManager, and potential hermeticity breaks in the UCX build scripts where host tools or incorrect linker flags might be used.

Comment thread ucx_context/ucx_memory_resource.hpp Outdated

struct UcxAllocatorContext {
UcxMemoryResourceManager* mr;
bool is_alive;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The is_alive flag is accessed from multiple threads (e.g., the thread destroying the UcxMemoryResourceManager and threads destroying UcxBuffer objects) without synchronization. This constitutes a data race. Please change is_alive to std::atomic<bool> to ensure thread-safe access.

Suggested change
bool is_alive;
std::atomic<bool> is_alive;

Comment thread ucx_context/ucx_memory_resource.hpp Outdated
UcxMemoryResourceManager() = default;
virtual ~UcxMemoryResourceManager() = default;
UcxMemoryResourceManager()
: allocator_context_(new UcxAllocatorContext{this, true}) {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The allocator_context_ is allocated with new in the constructor but is never deleted in the destructor (line 51 only sets is_alive = false). This results in a memory leak of the UcxAllocatorContext object every time a UcxMemoryResourceManager is destroyed. While the object is small, it should be properly managed, for example by using a std::unique_ptr if ownership is strictly with the MR, or a std::shared_ptr if buffers need to keep the context alive after the MR is gone.

action_name = "c-compile",
variables = cc_common.create_compile_variables(
feature_configuration = feature_configuration,
cc_toolchain = cc_toolchain,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The UCX build process involves compiling many C files. The current rule only passes ctx.fragments.cpp.copts to the environment. For a C project like UCX, you should also include ctx.fragments.cpp.conlyopts to ensure that C-specific flags (like -std=gnu17 defined in .bazelrc) are correctly applied during the hermetic build.

Suggested change
cc_toolchain = cc_toolchain,
user_compile_flags = ctx.fragments.cpp.copts + ctx.fragments.cpp.conlyopts,

Comment thread third_party/openucx/build_ucx.sh Outdated
Comment on lines +128 to +130
elif command -v nvcc &>/dev/null; then
export NVCC=$(command -v nvcc)
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback to the host's nvcc via command -v nvcc violates the goal of a fully hermetic build process. If CUDA is enabled but the hermetic nvcc is not found in the Bazel sandbox, the build should fail rather than picking up a potentially incompatible version from the host system.

Comment thread third_party/openucx/build_ucx.sh Outdated
Comment on lines +257 to +261
$CC -shared -fPIC -Wno-error \
-I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \
-o "$EXECROOT/sysroot/lib/libxpmem.so" \
"$XPMEM_SRC"
echo " Built libxpmem.so"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When building libxpmem.so from source, the script is missing the $LDFLAGS environment variable. This might cause the compiler to use the host linker instead of the hermetic LLVM linker (ld.lld) configured earlier in the script, potentially leading to ABI issues or non-reproducible builds.

Suggested change
$CC -shared -fPIC -Wno-error \
-I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \
-o "$EXECROOT/sysroot/lib/libxpmem.so" \
"$XPMEM_SRC"
echo " Built libxpmem.so"
$CC -shared -fPIC $LDFLAGS -Wno-error \
-I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \
-o "$EXECROOT/sysroot/lib/libxpmem.so" \
"$XPMEM_SRC"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants