Skip to content

align workflow snippets with create_ingestor API#2212

Open
kheiss-uwzoo wants to merge 4 commits into
NVIDIA:mainfrom
kheiss-uwzoo:kheiss/snippets
Open

align workflow snippets with create_ingestor API#2212
kheiss-uwzoo wants to merge 4 commits into
NVIDIA:mainfrom
kheiss-uwzoo:kheiss/snippets

Conversation

@kheiss-uwzoo

@kheiss-uwzoo kheiss-uwzoo commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Migrate extraction examples from Ingestor() to create_ingestor(run_mode="batch") and fix broken Python fences in audio-video.md.
  • Show ASRParams.audio_endpoints in the Helm Parakeet example so batch graph ingest reaches the in-cluster NIM (audio:50051); note that endpoint auto-wiring applies to the retriever service, not graph ingest (Greptile).
  • Correct custom-metadata.md vdb_upload parameters (vdb_kwargs, sidecar meta triplet) and document remote lancedb_uri for service mode.
  • Update GraphIngestor.vdb_upload guidance and document ingest() return types (ray.data.Dataset in batch; pandas.DataFrame in inprocess); rename the workflow snippet variable to result (Greptile).
  • Add Python API guide cross-links, update Helm anchor, and apply NVIDIA style fixes (refer to, through, such as).

Test plan

  • python skills/scripts/validate_code_blocks.py docs/docs/extraction/ (code fences parse)
  • Spot-check rendered pages: audio-video, custom-metadata, embedding, workflow-document-ingestion
  • Confirm internal links resolve in docs build

Update extraction docs to use create_ingestor, correct vdb_upload and sidecar metadata examples, fix GraphIngestor guidance, and tighten cross-links and style.
@kheiss-uwzoo kheiss-uwzoo marked this pull request as ready for review June 8, 2026 18:15
@kheiss-uwzoo kheiss-uwzoo requested review from a team as code owners June 8, 2026 18:15
@kheiss-uwzoo kheiss-uwzoo requested review from edknv and jperez999 June 8, 2026 18:15
@kheiss-uwzoo kheiss-uwzoo self-assigned this Jun 8, 2026
@kheiss-uwzoo kheiss-uwzoo added the doc Improvements or additions to documentation label Jun 8, 2026
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR updates six extraction documentation pages to align code snippets with the create_ingestor(run_mode="batch") API, replacing the legacy Ingestor() constructor throughout. It also corrects vdb_upload parameter names in custom-metadata.md, adds the missing ASRParams.audio_endpoints wiring in the Helm Parakeet example, fixes a broken Python fence in audio-video.md, and applies NVIDIA style guide fixes (refer to, through, such as).

  • API migration: all Ingestor() calls across embedding.md, audio-video.md, and workflow-document-ingestion.md are replaced with create_ingestor(run_mode="batch"), with from nemo_retriever import create_ingestor imports added where missing.
  • custom-metadata.md corrections: vdb_upload now uses vdb_kwargs={"lancedb_uri": ..., "table_name": ...} with the sidecar meta triplet (meta_dataframe, meta_source_field, meta_fields), and documents that service mode requires a remote lancedb_uri.
  • workflow-document-ingestion.md: renames chunksresult and corrects the return-type comment to distinguish ray.data.Dataset (batch) from pandas.DataFrame (inprocess).

Confidence Score: 5/5

Documentation-only changes that correct API names, parameter signatures, and return-type comments — no production code paths are modified.

All code examples are syntactically correct and consistent with the GraphIngestor implementation (vdb_upload is confirmed implemented). The only finding is a single leftover Ingestor reference in vdbs.md that is inconsistent with the updated line in the same file, but it does not affect runtime behavior.

docs/docs/extraction/vdbs.md — line 80 still names the old Ingestor class while line 129 was updated to GraphIngestor.vdb_upload.

Important Files Changed

Filename Overview
docs/docs/extraction/audio-video.md Migrates both Helm and hosted-inference snippets from Ingestor() to create_ingestor(run_mode="batch"); adds missing audio_endpoints for in-cluster Parakeet; fixes broken Python fence; adds results = ingestor.ingest() call; applies NVIDIA style changes.
docs/docs/extraction/custom-metadata.md Corrects vdb_upload parameter names to vdb_kwargs, meta_dataframe, meta_source_field, and meta_fields; adds guidance on remote lancedb_uri for service mode; simplifies the On this page TOC to match the consolidated section structure.
docs/docs/extraction/embedding.md Replaces all three Ingestor() calls with create_ingestor(run_mode="batch") and adds missing imports; minor style and punctuation fixes; removes duplicate link from related topics.
docs/docs/extraction/workflow-document-ingestion.md Renames chunks variable to result; corrects the return-type comment to distinguish ray.data.Dataset (batch) from pandas.DataFrame (inprocess); updates description of GraphIngestor.vdb_upload to reflect that it is now implemented.
docs/docs/extraction/vdbs.md Updates VDB upload wording to use create_ingestor(...) / GraphIngestor.vdb_upload; adds Python API guide cross-link; one sentence at line 80 still names the old Ingestor class while line 129 in the same file now correctly says GraphIngestor.vdb_upload.
docs/docs/extraction/nimclient.md Adds a single cross-reference sentence pointing readers to the Python API guide for ingest and pipeline APIs used in NimClient UDFs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["create_ingestor(run_mode=...)"] --> B["GraphIngestor (batch / inprocess)"]
    A --> C["ServiceIngestor (service)"]
    B --> D[".files(...)"]
    D --> E[".extract(...)"]
    E --> F[".embed(...)"]
    F --> G[".vdb_upload(...) now documented"]
    G --> H[".ingest() → ray.data.Dataset (batch) or pandas.DataFrame (inprocess)"]
    C --> D2[".files(...)"]
    D2 --> E2[".extract(...)"]
    E2 --> F2[".embed(...)"]
    F2 --> G2[".vdb_upload(vdb_kwargs={lancedb_uri, table_name}, meta_*)"]
    G2 --> H2[".ingest_async().result()"]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
docs/docs/extraction/vdbs.md:80
This sentence still refers to the old `Ingestor` class while line 129 of the same file (updated by this PR) uses `GraphIngestor.vdb_upload`. The stale name creates a contradiction inside one document, which could confuse readers following the migration guidance.

```suggestion
When using `GraphIngestor.vdb_upload`, pass `vdb_op="lancedb"` or a `LanceDB` instance so uploads target LanceDB. If you omit `vdb_op`, the library still defaults the string argument to `"milvus"` for backward compatibility, which is not the LanceDB operator—always pass `vdb_op="lancedb"` when you intend LanceDB.
```

Reviews (3): Last reviewed commit: "Merge branch 'main' into kheiss/snippets" | Re-trigger Greptile

Comment thread docs/docs/extraction/audio-video.md
Comment thread docs/docs/extraction/workflow-document-ingestion.md Outdated
@kheiss-uwzoo kheiss-uwzoo changed the title docs(extraction): align workflow snippets with create_ingestor API align workflow snippets with create_ingestor API Jun 8, 2026
@kheiss-uwzoo kheiss-uwzoo requested a review from randerzander June 8, 2026 19:57
…st return type

Show audio_endpoints in the Helm Parakeet example and rename the workflow snippet variable so batch mode readers are not misled by a DataFrame name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant