[Feat/Perf] Hermetic UCX Build with CUDA Support and AM Performance Optimizations#20
[Feat/Perf] Hermetic UCX Build with CUDA Support and AM Performance Optimizations#20MoFHeka wants to merge 20 commits into
Conversation
…mory safety. Which is "pure virtual method called" may happened in new benchmark script axon/python/tests/benchmark_active_message.py.
…nt throughput This commit completely eliminates dynamic heap allocations in the hot path for Active Message operations, significantly reducing allocator lock contention under high concurrency. Key optimizations: 1. Replaced `std::unique_ptr` with `std::optional` for `CqeEntryCallback` and `DirectEntryCallback` in `ucx_am_context` sender/receiver operations, embedding the callback lifecycle directly into the unifex operation state. 2. Refactored `InvokeContext` (Python bindings) to utilize stack-allocated `std::array` instead of `std::vector` for the standard fast-path (<= 4 tensors), falling back to dynamic allocation only for large RPC arguments. 3. Updated `UcxConnection` to accept raw callback pointers instead of transferring ownership via `std::unique_ptr`. Performance gains (measured via concurrent throughput benchmark): - Eager-Tiny (64 B): 342.0 QPS -> 487.8 QPS (+42.6%) - Eager-Small (4 KiB): 288.6 QPS -> 450.6 QPS (+56.1%) These architectural zero-copy/stack-only changes significantly elevate the RPC throughput ceiling and reduce tail latency jitter in highly concurrent distributed training workloads.
…dd active message benchmark ablation suite.
…for UCX and update CUDA configuration logic.
There was a problem hiding this comment.
Code Review
This pull request refactors the Bazel build system to implement a hermetic ML toolchain for C++ and CUDA, modularizes configuration files, and introduces a custom native build rule for UCX. It also adds a Python performance benchmark and improves GIL safety during interpreter shutdown. Feedback focuses on improving the UCX build script by respecting Bazel's compiler environment and resource management, ensuring strict hermeticity for CUDA tools, and optimizing Python object lifecycle management during shutdown.
…pre-built .deb libraries and building missing components from source. Fix test bug.
… suppress unused parameter warnings.
…ild process with component stripping.
…add Python package configuration for axon.
…nd remove obsolete scripts
…and introduce utility scripts for development and benchmarking.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a major update to the build system and memory management architecture, transitioning to modular Bazel configurations and a hermetic ML toolchain. A key feature is the introduction of UcxAllocatorContext to safely manage the lifecycle of memory resources, preventing segmentation faults when objects outlive the AxonRuntime. The Python bindings have also been restructured for better distribution and performance. Technical feedback identifies several critical issues, including a data race on the is_alive lifecycle flag, a memory leak in the UcxMemoryResourceManager, and potential hermeticity breaks in the UCX build scripts where host tools or incorrect linker flags might be used.
|
|
||
| struct UcxAllocatorContext { | ||
| UcxMemoryResourceManager* mr; | ||
| bool is_alive; |
There was a problem hiding this comment.
The is_alive flag is accessed from multiple threads (e.g., the thread destroying the UcxMemoryResourceManager and threads destroying UcxBuffer objects) without synchronization. This constitutes a data race. Please change is_alive to std::atomic<bool> to ensure thread-safe access.
| bool is_alive; | |
| std::atomic<bool> is_alive; |
| UcxMemoryResourceManager() = default; | ||
| virtual ~UcxMemoryResourceManager() = default; | ||
| UcxMemoryResourceManager() | ||
| : allocator_context_(new UcxAllocatorContext{this, true}) {} |
There was a problem hiding this comment.
The allocator_context_ is allocated with new in the constructor but is never deleted in the destructor (line 51 only sets is_alive = false). This results in a memory leak of the UcxAllocatorContext object every time a UcxMemoryResourceManager is destroyed. While the object is small, it should be properly managed, for example by using a std::unique_ptr if ownership is strictly with the MR, or a std::shared_ptr if buffers need to keep the context alive after the MR is gone.
| action_name = "c-compile", | ||
| variables = cc_common.create_compile_variables( | ||
| feature_configuration = feature_configuration, | ||
| cc_toolchain = cc_toolchain, |
There was a problem hiding this comment.
The UCX build process involves compiling many C files. The current rule only passes ctx.fragments.cpp.copts to the environment. For a C project like UCX, you should also include ctx.fragments.cpp.conlyopts to ensure that C-specific flags (like -std=gnu17 defined in .bazelrc) are correctly applied during the hermetic build.
| cc_toolchain = cc_toolchain, | |
| user_compile_flags = ctx.fragments.cpp.copts + ctx.fragments.cpp.conlyopts, |
| elif command -v nvcc &>/dev/null; then | ||
| export NVCC=$(command -v nvcc) | ||
| fi |
There was a problem hiding this comment.
| $CC -shared -fPIC -Wno-error \ | ||
| -I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \ | ||
| -o "$EXECROOT/sysroot/lib/libxpmem.so" \ | ||
| "$XPMEM_SRC" | ||
| echo " Built libxpmem.so" |
There was a problem hiding this comment.
When building libxpmem.so from source, the script is missing the $LDFLAGS environment variable. This might cause the compiler to use the host linker instead of the hermetic LLVM linker (ld.lld) configured earlier in the script, potentially leading to ABI issues or non-reproducible builds.
| $CC -shared -fPIC -Wno-error \ | |
| -I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \ | |
| -o "$EXECROOT/sysroot/lib/libxpmem.so" \ | |
| "$XPMEM_SRC" | |
| echo " Built libxpmem.so" | |
| $CC -shared -fPIC $LDFLAGS -Wno-error \ | |
| -I"$XPMEM_INC" -I"$XPMEM_LIB_DIR" \ | |
| -o "$EXECROOT/sysroot/lib/libxpmem.so" \ | |
| "$XPMEM_SRC" |
Description:
This PR establishes a fully hermetic build process for UCX within the Bazel environment and introduces significant performance optimizations for the Active Message (AM) runtime. It also enables GPU-aware communication through properly integrated CUDA support.
Key Changes:
build_ucx.shand Bazel build rules to eliminate all host-system dependency fallbacks. UCX is now built strictly within the Bazel sandbox using the project's hermetic toolchain.@rules_ml_toolchainfor robust CUDA detection. Ensuredlibcuda.soand related headers are correctly exposed to the UCX build process, enabling seamless GPU memory communication.ucx_am_contextanducx_connectionfor lower latency and better resource management.sin_zerozero-out) to ensure reliable connection establishment.//axon/python:benchmark_active_messageto validate throughput and latency across both host and device memory.third_party/openucxbuild logic and consolidated.bazelrcconfigurations for better maintainability and reproducibility.