Skip to content

Design issue: [[git_repos]] mount_as creates fragile version mismatch #850

@rutayan-nv

Description

@rutayan-nv

Summary

The [[git_repos]] mechanism with mount_as creates a fragile setup where scripts from one version can call core libraries from another version, leading to hard-to-debug errors.

Problem

When using [[git_repos]] with mount_as, a partial override occurs:

  • Container has built-in package (e.g., Megatron-Bridge v0.4.0rc0 at /opt/Megatron-Bridge)
  • External git clone (e.g., v0.3.1) provides entry scripts via PYTHONPATH
  • Scripts from v0.3.1 import core modules from container's v0.4.0rc0

This causes:

  1. ModuleNotFoundError - Different module structure between versions
  2. API mismatches - Functions/parameters differ between versions
  3. Silent failures - No validation that git repo version is compatible with container

Observed Errors

ModuleNotFoundError: No module named 'megatron.core'
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading

Root Causes

  1. Partial mounting - mount_as overwrites some paths but not others
  2. Two sources of truth - [[git_repos]] commit vs container's built-in version
  3. Implicit dependencies - No enforcement that versions match

Proposed Solutions

  1. Version validation - Validate git repo commit is compatible with container
  2. Full override or none - mount_as must override entire package or nothing
  3. Container-only mode - Warn if [[git_repos]] targets a package already in container
  4. Deprecate partial mounts - Remove support for mounting over container paths

Environment

  • CloudAI version: v1.6.beta6
  • Container: nvcr.io/nvidian/nemo:26.04.rc2 (Megatron-Bridge v0.4.0rc0)
  • External repo: Megatron-Bridge v0.3.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions