Skip to content

Handling dependency differences (CUDA / Boost) for wire-cell-toolkit@spng in Spack builds #39

@physnerds

Description

@physnerds

Context

We are currently developing and testing the SPNG branch of Wire-Cell Toolkit on a GPU machine (wcgpu) using a Spack-based environment (wcwc installation).

As we prepare to scale and deploy to other machines, we are relying heavily on Spack installations of wire-cell-toolkit@spng.


Problem

We are encountering subtle but critical dependency differences across environments, particularly:

  • CUDA version differences (e.g., CUDA 11 vs CUDA 12+)
  • Boost version differences

These differences require small but necessary source-level changes in SPNG for successful builds.


Example: NVTX changes in CUDA ≥ 12

In CUDA 12 and newer:

  • NVTX is now header-only

  • Headers moved to:

    <nvtx3/nvToolsExt.h>
    
  • The libnvToolsExt library no longer exists

Required changes in SPNG:

// Old
#include <nvToolsExt.h>

// New
#include <nvtx3/nvToolsExt.h>

And remove linking against:

-lnvToolsExt

Current Workaround (Spack patch via package.py)

We are applying temporary fixes using filter_file() inside the patch() function:

filter_file(
    "            libs += ['nvToolsExt']",
    "            pass  # nvToolsExt skipped; NVTX3 is header-only in modern CUDA",
    "waft/libtorch.py",
    string=True,
)

filter_file(
    r",-lnvToolsExt",
    "",
    "waft/libtorch.py",
)

filter_file(
    "if lib == 'python':",
    "if lib == 'system':\n\t\t\t\tcontinue\n\t\t\tif lib == 'python':",
    "waft/boost.py",
    string=True,
)

Why this is a problem

  • These fixes are environment-specific hacks
  • Hard to manage maintain if the software is used over different machines and architectures:
  • When source code is changed along the patches part, these patches also need to be fixed in wire-cell-spack repository
  • They introduce maintenance overhead in Spack recipes
  • Portability overhead

Suggested Solutions

Should we add conditional compilation flag in the SPNG codebase? As the script is used across different machines or built against different dependency versions, it might grow but it is probably more stable than patches in package.py in the long run? Examples (based on experience of building in wcgpu and sgpu machines with different CUDA version):

  1. Add conditional compilation in SPNG
  • Detect CUDA version

  • Switch include paths:

    #if CUDA_VERSION >= 12000
      #include <nvtx3/nvToolsExt.h>
    #else
      #include <nvToolsExt.h>
    #endif
  1. Remove explicit linkage to nvToolsExt --> This is already done TorchnvToolks

    • Treat NVTX as optional / header-only where applicable
  2. Improve build system (waft)

    • Detect NVTX availability dynamically
    • Avoid hardcoding -lnvToolsExt (based on CUDA version available)

Temporary Recommendation

Until a proper fix is implemented:

  • Continue using Spack package.py patches (filter_file) as a workaround
  • Track all such dependency-related patches centrally
  • Should we add a comment in part of the SPNG code where patches are applied so that we are aware when to make changes in the package.py of wire-cell-spack?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions