Skip to content

Reorg/bucketize#2215

Open
jperez999 wants to merge 14 commits into
NVIDIA:mainfrom
jperez999:reorg/bucketize
Open

Reorg/bucketize#2215
jperez999 wants to merge 14 commits into
NVIDIA:mainfrom
jperez999:reorg/bucketize

Conversation

@jperez999

@jperez999 jperez999 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Reorganize nemo_retriever package into purpose-named buckets

Restructures the flat ~465-module nemo_retriever package (≈40 top-level dirs)
into cohesive, purpose-named buckets so a module's path tells you what it does.
Pure code-movement — no behavior changes. Internal imports are rewritten to
the new paths; every old import path also keeps working via a re-export shim.

Buckets

Bucket Contents
cli/ All Typer commands: root app, pipeline/local, per-modality CLIs
graph/ Graph, executors, graph construction, query/retrieval graph, stage framework
operators/ Every Operator/actor/stage class (base, graph ops, extract actors, embed/rerank/dedup/vdb)
models/ HF local loaders, NIM clients, LLM/judge clients, embedding inference
common/ Shared logic: params, schemas, io, vdb client, vendored api/, per-modality helpers
ingestor/ Ingest orchestration + planning/manifest/results/config
service/ FastAPI service, routers, workers, service-mode ingestor
tools/ Eval/dev tooling: evaluation, skill_eval, recall, benchmark
tabular_data/ Self-contained relational vertical (kept at its original path); its 2 graph operators move into tabular_data/operators/

Key decisions

  • Strict split of per-modality dirs (pdf, chart, audio, …): actor classes → operators/, helpers/config → common/, model clients stay in models/.
  • Query/VDB → graph: retriever.pygraph/, vdb operators → operators/, vdb client → common/.
  • api/ kept cohesive under common/api/ (vendored), rather than shredded.
  • tabular_data/ keeps its original path — only its two graph operators move in, so nemo_retriever.tabular_data.* imports are unchanged.
  • Harness is intentionally NOT bucketed. PR Remove legacy nv-ingest harness and root Compose stack #2193 removed a legacy tools/harness, and test_legacy_tools_harness_is_removed forbids that path token; harness also resolves its project-level data dir (nemo_retriever/harness/) via Path(__file__).parents[3], which only holds at its existing depth. So harness stays at src/nemo_retriever/harness/ and tools/ contains only evaluation/skill_eval/recall/benchmark.

Backward compatibility

  • 341 alias shims left at old import paths (each delegates to common/_shim.py, emits a DeprecationWarning, aliases the module). Public API and the retriever/retriever-harness entry points are unchanged.
  • All shims are catalogued in nemo_retriever/src/nemo_retriever/SHIMS.md (old → new → phase), so the eventual pass.

How it was sequenced / validated

  • 12 commits: phase 0 (relative→absolute imports), one PR per bucket (PR1–PR9), PR10 (rewrite internal imports tp harness in-package + fix tests).
  • Every phase validated with a full import-sweep diffed against a pre-reorg baseline (no new failures).
  • Full test suite: 2031 passed, 32 skipped, 1 failed. The 1 failure (test_pipeline_helpers::TestBuildIngestor::test_service_mode_wires_extract_embed_and_chunking) is pre-existing — confirmed it fails identically on main. Zero reorg regressions.
  • CLI verified end-to-end (retriever --help/--version; lazy subapps load).

Test updates required (beyond import swaps)

Moving the code surfaced cases plain import-path edits don't cover, all fixed here:

  • mock.patch("...") string targets repointed to the new modules (shims aren't transparent under attribute trav
  • from PKG import submod forms repointed (these attribute-traversed onto bare shim objects and caused fresh-pr workers / test collection).
  • Multi-hop shim chains resolved transitively (api → common.api → models.nim).
  • Tests that reference moved files by path updated (media_interface load, nemotron_ocr_v2 source read, evaluation/README.md, skill_eval config).
  • ingestor/__init__ re-exports the full module-level surface (e.g. IngestorCreateParams) the old `nemo_retri.

Incidental fixes required by the moves

  • .gitignore exception so the package's models/ source dir is tracked (the global models/ rule targets weight artifacts; mirrors the existing service/models exception).
  • operators/__init__ resolves its re-exports lazily to avoid an init-time import cycle.

Not in this PR (follow-up)

Deleting the shims + SHIMS.md — deferred to a later pass after a deprecation window. Internal code already usee shims serve only external importers until then.

Key changes from the earlier draft: tabular/ → tabular_data/ (kept at original path), harness removed from tools/ (with the rationale), shim count updated to 341, added the test-updates and incidental-fixes sections, and the verified 2031 passed / 1 pre-existing result.

root and others added 10 commits June 6, 2026 03:12
Mechanical prep for the 9-bucket reorganization (see REORG_PLAN.md): converts
all 161 relative imports across 69 files to absolute nemo_retriever.* form so
that subsequent file moves don't break sibling references.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Relocates params, io, version, config_utils, the vdb client layer, the vendored
api/ subtree (kept cohesive under common/api/), service data contracts
(-> common/schemas), policy, and all per-modality helper/config modules into the
new common/ bucket. Every old import path is preserved by an auto-generated
re-export shim that delegates to common/_shim.py (all recorded in
nemo_retriever/SHIMS.md). Tree stays fully importable, verified by an import
sweep diffed against the pre-reorg baseline (no new failures).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moves model/ (HF local loaders + factories), nim/ (NIM client), llm/ (LLM &
judge clients), the text_embed inference layer (-> models/inference/), the HF
cache/registry utils, and the api NIM primitives/util subtrees (-> models/nim/)
into the new models/ bucket. Adds a .gitignore exception so the package's
models/ source dir is tracked (the global models/ rule is for weight artifacts;
mirrors the existing service/models exception). Old paths preserved via shims
(SHIMS.md); import sweep green vs baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the graph/operators split: operator base classes + archetype and the
concrete graph operators move out of graph/ into operators/ and
operators/graph_ops/. Every per-modality actor moves to operators/extract/<mod>/,
embedding actors to operators/embed/, and rerank/dedup/vdb operators plus the
eval operators land at the top level. Modality dirs keep their __init__ as thin
packages re-exporting via shims. Old paths preserved (SHIMS.md); sweep green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings retriever.py (-> graph/retriever.py), retriever_graph_utils.py
(-> graph/retriever_utils.py) and application/pipeline/ (-> graph/stages/) into
the graph/ bucket; the executor/graph framework files already lived there.

Also breaks an import cycle exposed by the move: operators/__init__ now resolves
its compat re-exports lazily, and operators/{base,content,embedding}.py point at
the canonical operators.* locations instead of routing back through graph shims.
Sweep green vs baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ingestor.py becomes the ingestor/ package (impl in ingestor/core.py; the package
__init__ re-exports create_ingestor/ingestor/Ingestor so nemo_retriever.ingestor
keeps its public API). graph_ingestor, branch_extraction and the ingest_* config/
manifest/plans/results modules move under ingestor/ (renamed without the ingest_
prefix), and the sample ingest-config.yaml moves alongside. Old module paths
preserved via shims (SHIMS.md); sweep green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ServiceIngestor (the remote service-mode ingestor client) joins the service/
bucket. The rest of service/ already was the bucket; service models and policy
moved to common/ in PR1. Old path shimmed; sweep green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moves the root Typer app (adapters/cli -> cli/), the pipeline/local subcommand
packages, the image/compare dev tools, and every per-modality CLI module
(stage.py/cli.py/commands.py/__main__.py) into cli/<modality>/. Repoints the
package __main__ entrypoint at cli.main. Verified the root app and the moved
lazy subapps (pdf, pipeline, local, audio, txt, chart) all load; sweep green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Relocates evaluation, skill_eval, recall (incl. graph/{beir,recall}_eval.py),
harness (+portal), and the benchmark utils into the tools/ bucket, including
their non-Python assets (skill_eval prompts/configs, harness portal static).
Updates the skill_eval package-data path to nemo_retriever.tools.skill_eval and
verified prompt resources still resolve. Old paths shimmed; sweep green; eval/
skill-eval/recall/benchmark subapps load.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keeps the self-contained relational vertical at its existing tabular_data/ path
(a stable, unchanged import location) and brings its two graph operators
(tabular_fetch_embeddings, tabular_schema_extract) in from graph/ to
tabular_data/operators/, so the whole vertical lives under one bucket. Old graph
paths shimmed; sweep green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Julio Perez <jperez@nvidia.com>
@jperez999 jperez999 requested review from a team as code owners June 8, 2026 14:15
@jperez999 jperez999 requested a review from nkmcalli June 8, 2026 14:15
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

root and others added 2 commits June 8, 2026 14:47
Rewrites 1845 internal import lines across 395 files from old shim paths to the
new bucket paths (driven by SHIMS.md/move_map). Required for correctness, not
just cleanup: moved modules importing siblings through old shim paths created
import cycles in fresh processes (Ray workers) because the old package __init__
re-imported the half-initialized moved module. Also fixes ingestor/__init__ to
re-export the full module-level surface (e.g. IngestorCreateParams) the old
nemo_retriever.ingestor module exposed. Import sweep green; previously-failing
Ray audio benchmark test now passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harness is NOT bucketed: PR NVIDIA#2193 deliberately removed a legacy tools/harness and
a guard test forbids that path token, so harness stays at
src/nemo_retriever/harness/ (its config.py resolves a project-level data dir via
parents[3], which only works at that depth). tools/ keeps evaluation, skill_eval,
recall, benchmark.

Also fixes the remaining reorg test failures (all beyond plain import swaps):
- resolve multi-hop shim chains (api->common.api->models.nim) so source imports
  and mock.patch() targets hit the final module, not an intermediate shim;
- repoint 'from PKG import submod' forms that attribute-traversed onto bare shim
  objects (this fixed fresh-process import cycles, e.g. Ray, test collection);
- update tests referencing moved files by path (media_interface load,
  nemotron_ocr_v2 source read, evaluation/README.md, skill_eval config);
- caption model_profiles: reload the new module path so the monkeypatch applies,
  keep the deprecation match on the old source-emitted string.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Julio Perez <jperez@nvidia.com>
root and others added 2 commits June 8, 2026 17:29
pyproject declares a dynamic version via attr='nemo_retriever.version.get_build_version'.
setuptools reads that during an isolated build, before the package is importable,
by exec-ing version.py. PR1 had moved version.py to common/ and left a shim that
does 'from nemo_retriever.common._shim import alias' -> ModuleNotFoundError at
build time. version.py is build-special and must stay a self-contained, real
module at src/nemo_retriever/version.py. Reverted the move; repointed
nemo_retriever.common.version references back to nemo_retriever.version.

Verified: setuptools read_configuration(expand=True) resolves the dynamic
version; import sweep green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Julio Perez <jperez@nvidia.com>
test_legacy_tools_harness_is_removed scans the whole repo for the forbidden
'tools/harness' token. It failed locally only because a developer's untracked
tmp/skill_eval/ scratch (installed package copies from before the harness revert)
still contained the string; the tracked source is clean and CI passes.

- Add 'tmp' to IGNORED_SCAN_DIRS (alongside .venv/artifacts/lancedb).
- Prune ignored dirs during os.walk instead of rglob-then-filter, so the scan
  never descends into large generated/scratch trees (151GB tmp/ -> 0.5s, was a
  timeout). Semantics unchanged; invariant preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Julio Perez <jperez@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant