Reorg/bucketize#2215
Open
jperez999 wants to merge 14 commits into
Open
Conversation
Mechanical prep for the 9-bucket reorganization (see REORG_PLAN.md): converts all 161 relative imports across 69 files to absolute nemo_retriever.* form so that subsequent file moves don't break sibling references. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Relocates params, io, version, config_utils, the vdb client layer, the vendored api/ subtree (kept cohesive under common/api/), service data contracts (-> common/schemas), policy, and all per-modality helper/config modules into the new common/ bucket. Every old import path is preserved by an auto-generated re-export shim that delegates to common/_shim.py (all recorded in nemo_retriever/SHIMS.md). Tree stays fully importable, verified by an import sweep diffed against the pre-reorg baseline (no new failures). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moves model/ (HF local loaders + factories), nim/ (NIM client), llm/ (LLM & judge clients), the text_embed inference layer (-> models/inference/), the HF cache/registry utils, and the api NIM primitives/util subtrees (-> models/nim/) into the new models/ bucket. Adds a .gitignore exception so the package's models/ source dir is tracked (the global models/ rule is for weight artifacts; mirrors the existing service/models exception). Old paths preserved via shims (SHIMS.md); import sweep green vs baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the graph/operators split: operator base classes + archetype and the concrete graph operators move out of graph/ into operators/ and operators/graph_ops/. Every per-modality actor moves to operators/extract/<mod>/, embedding actors to operators/embed/, and rerank/dedup/vdb operators plus the eval operators land at the top level. Modality dirs keep their __init__ as thin packages re-exporting via shims. Old paths preserved (SHIMS.md); sweep green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings retriever.py (-> graph/retriever.py), retriever_graph_utils.py
(-> graph/retriever_utils.py) and application/pipeline/ (-> graph/stages/) into
the graph/ bucket; the executor/graph framework files already lived there.
Also breaks an import cycle exposed by the move: operators/__init__ now resolves
its compat re-exports lazily, and operators/{base,content,embedding}.py point at
the canonical operators.* locations instead of routing back through graph shims.
Sweep green vs baseline.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ingestor.py becomes the ingestor/ package (impl in ingestor/core.py; the package __init__ re-exports create_ingestor/ingestor/Ingestor so nemo_retriever.ingestor keeps its public API). graph_ingestor, branch_extraction and the ingest_* config/ manifest/plans/results modules move under ingestor/ (renamed without the ingest_ prefix), and the sample ingest-config.yaml moves alongside. Old module paths preserved via shims (SHIMS.md); sweep green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ServiceIngestor (the remote service-mode ingestor client) joins the service/ bucket. The rest of service/ already was the bucket; service models and policy moved to common/ in PR1. Old path shimmed; sweep green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moves the root Typer app (adapters/cli -> cli/), the pipeline/local subcommand packages, the image/compare dev tools, and every per-modality CLI module (stage.py/cli.py/commands.py/__main__.py) into cli/<modality>/. Repoints the package __main__ entrypoint at cli.main. Verified the root app and the moved lazy subapps (pdf, pipeline, local, audio, txt, chart) all load; sweep green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Relocates evaluation, skill_eval, recall (incl. graph/{beir,recall}_eval.py),
harness (+portal), and the benchmark utils into the tools/ bucket, including
their non-Python assets (skill_eval prompts/configs, harness portal static).
Updates the skill_eval package-data path to nemo_retriever.tools.skill_eval and
verified prompt resources still resolve. Old paths shimmed; sweep green; eval/
skill-eval/recall/benchmark subapps load.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keeps the self-contained relational vertical at its existing tabular_data/ path (a stable, unchanged import location) and brings its two graph operators (tabular_fetch_embeddings, tabular_schema_extract) in from graph/ to tabular_data/operators/, so the whole vertical lives under one bucket. Old graph paths shimmed; sweep green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Julio Perez <jperez@nvidia.com>
Rewrites 1845 internal import lines across 395 files from old shim paths to the new bucket paths (driven by SHIMS.md/move_map). Required for correctness, not just cleanup: moved modules importing siblings through old shim paths created import cycles in fresh processes (Ray workers) because the old package __init__ re-imported the half-initialized moved module. Also fixes ingestor/__init__ to re-export the full module-level surface (e.g. IngestorCreateParams) the old nemo_retriever.ingestor module exposed. Import sweep green; previously-failing Ray audio benchmark test now passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harness is NOT bucketed: PR NVIDIA#2193 deliberately removed a legacy tools/harness and a guard test forbids that path token, so harness stays at src/nemo_retriever/harness/ (its config.py resolves a project-level data dir via parents[3], which only works at that depth). tools/ keeps evaluation, skill_eval, recall, benchmark. Also fixes the remaining reorg test failures (all beyond plain import swaps): - resolve multi-hop shim chains (api->common.api->models.nim) so source imports and mock.patch() targets hit the final module, not an intermediate shim; - repoint 'from PKG import submod' forms that attribute-traversed onto bare shim objects (this fixed fresh-process import cycles, e.g. Ray, test collection); - update tests referencing moved files by path (media_interface load, nemotron_ocr_v2 source read, evaluation/README.md, skill_eval config); - caption model_profiles: reload the new module path so the monkeypatch applies, keep the deprecation match on the old source-emitted string. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Julio Perez <jperez@nvidia.com>
6a4a159 to
3d08b62
Compare
pyproject declares a dynamic version via attr='nemo_retriever.version.get_build_version'. setuptools reads that during an isolated build, before the package is importable, by exec-ing version.py. PR1 had moved version.py to common/ and left a shim that does 'from nemo_retriever.common._shim import alias' -> ModuleNotFoundError at build time. version.py is build-special and must stay a self-contained, real module at src/nemo_retriever/version.py. Reverted the move; repointed nemo_retriever.common.version references back to nemo_retriever.version. Verified: setuptools read_configuration(expand=True) resolves the dynamic version; import sweep green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Julio Perez <jperez@nvidia.com>
test_legacy_tools_harness_is_removed scans the whole repo for the forbidden 'tools/harness' token. It failed locally only because a developer's untracked tmp/skill_eval/ scratch (installed package copies from before the harness revert) still contained the string; the tracked source is clean and CI passes. - Add 'tmp' to IGNORED_SCAN_DIRS (alongside .venv/artifacts/lancedb). - Prune ignored dirs during os.walk instead of rglob-then-filter, so the scan never descends into large generated/scratch trees (151GB tmp/ -> 0.5s, was a timeout). Semantics unchanged; invariant preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Julio Perez <jperez@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reorganize
nemo_retrieverpackage into purpose-named bucketsRestructures the flat ~465-module
nemo_retrieverpackage (≈40 top-level dirs)into cohesive, purpose-named buckets so a module's path tells you what it does.
Pure code-movement — no behavior changes. Internal imports are rewritten to
the new paths; every old import path also keeps working via a re-export shim.
Buckets
cli/graph/operators/models/common/api/, per-modality helpersingestor/service/tools/tabular_data/tabular_data/operators/Key decisions
operators/, helpers/config →common/, model clients stay inmodels/.graph:retriever.py→graph/, vdb operators →operators/, vdb client →common/.api/kept cohesive undercommon/api/(vendored), rather than shredded.tabular_data/keeps its original path — only its two graph operators move in, sonemo_retriever.tabular_data.*imports are unchanged.tools/harness, andtest_legacy_tools_harness_is_removedforbids that path token; harness also resolves its project-level data dir (nemo_retriever/harness/) viaPath(__file__).parents[3], which only holds at its existing depth. So harness stays atsrc/nemo_retriever/harness/andtools/contains only evaluation/skill_eval/recall/benchmark.Backward compatibility
common/_shim.py, emits aDeprecationWarning, aliases the module). Public API and theretriever/retriever-harnessentry points are unchanged.nemo_retriever/src/nemo_retriever/SHIMS.md(old → new → phase), so the eventual pass.How it was sequenced / validated
test_pipeline_helpers::TestBuildIngestor::test_service_mode_wires_extract_embed_and_chunking) is pre-existing — confirmed it fails identically onmain. Zero reorg regressions.retriever --help/--version; lazy subapps load).Test updates required (beyond import swaps)
Moving the code surfaced cases plain import-path edits don't cover, all fixed here:
mock.patch("...")string targets repointed to the new modules (shims aren't transparent under attribute travfrom PKG import submodforms repointed (these attribute-traversed onto bare shim objects and caused fresh-pr workers / test collection).api → common.api → models.nim).evaluation/README.md, skill_eval config).ingestor/__init__re-exports the full module-level surface (e.g.IngestorCreateParams) the old `nemo_retri.Incidental fixes required by the moves
.gitignoreexception so the package'smodels/source dir is tracked (the globalmodels/rule targets weight artifacts; mirrors the existingservice/modelsexception).operators/__init__resolves its re-exports lazily to avoid an init-time import cycle.Not in this PR (follow-up)
Deleting the shims +
SHIMS.md— deferred to a later pass after a deprecation window. Internal code already usee shims serve only external importers until then.Key changes from the earlier draft: tabular/ → tabular_data/ (kept at original path), harness removed from tools/ (with the rationale), shim count updated to 341, added the test-updates and incidental-fixes sections, and the verified 2031 passed / 1 pre-existing result.