From 019547d656e50318b76bab0c945d72df98da7eb2 Mon Sep 17 00:00:00 2001 From: Kurt Heiss Date: Tue, 2 Jun 2026 15:54:40 -0700 Subject: [PATCH 1/2] docs(extraction): backport PR #2194 extraction doc fixes to main PR #2194 merged into 26.05 on 2026-06-02 but never reached main. This backport keeps main aligned with the release branch and the published docs.nvidia.com site after Randy's follow-up review. Timeline: - Friday: 26.05 docs built for docs.nvidia upload; branch differed from NRL GitHub Pages source and the uploaded docs were incorrect. - Saturday: diff main vs 26.05 produced PR #2179 to sync extraction docs. - Monday: PR #2179 merged and docs uploaded to the public site. - Follow-up: Randy opened PR #2194 on 26.05 with additional fixes found after the #2179 sync. Those fixes landed on 26.05 only. - This commit: cherry-pick of c5b257e4 onto main (five extraction doc files only). Changes from #2194: - Fix audio-video.md indented code block rendering - Restore custom-metadata example service variables and storage prose - Move caption scope admonition to multimodal-extraction.md - Trim redundant Helm/OCR deploy detail per review feedback - Restore FAQ Docker Compose note and support-matrix section anchors --- docs/docs/extraction/audio-video.md | 21 ++++++------------- docs/docs/extraction/custom-metadata.md | 10 ++++++++- docs/docs/extraction/faq.md | 4 ++-- docs/docs/extraction/multimodal-extraction.md | 17 ++++++++++++--- .../prerequisites-support-matrix.md | 17 +++------------ 5 files changed, 34 insertions(+), 35 deletions(-) diff --git a/docs/docs/extraction/audio-video.md b/docs/docs/extraction/audio-video.md index c85df6e598..872edccf90 100644 --- a/docs/docs/extraction/audio-video.md +++ b/docs/docs/extraction/audio-video.md @@ -61,17 +61,11 @@ This pipeline enables retrieval at the speech segment level when you enable segm ## Run Parakeet on the cluster (Helm) { #run-parakeet-on-the-cluster-helm } -Use the following procedure to run the NIM on your own infrastructure. Self-hosted Parakeet runs on Kubernetes via the [NeMo Retriever Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md). Enable the ASR NIM per [Optional Helm NIMs](prerequisites-support-matrix.md#optional-helm-nims-not-auto-wired-by-default) and the [Helm chart — NIM operator sub-stack](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#nim-operator-sub-stack); pin the workload to a dedicated GPU and wire the ASR endpoint in your pipeline. +Use the following procedure for self-hosted Parakeet on your cluster. For chart enablement, GPU placement, ffmpeg, and endpoint wiring, see [Optional Helm NIMs](prerequisites-support-matrix.md#optional-helm-nims-not-auto-wired-by-default) and [Audio and video (Parakeet ASR)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#audio-video-parakeet) in the Helm chart README, plus [Deployment options](deployment-options.md). -After deploy, call the pipeline from Python: +1. Deploy or upgrade per that Helm guide and [Deployment options](deployment-options.md). - Pin the Parakeet workload to the dedicated GPU with your Helm values or the [NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html) (for example, node selectors, resource limits, or device requests appropriate to your cluster). - -1. Deploy or upgrade with the [NeMo Retriever Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) and enable Parakeet for your release (see [Optional Helm NIMs](prerequisites-support-matrix.md#optional-helm-nims-not-auto-wired-by-default)). Follow [Deployment options](deployment-options.md). - -2. If the service will process audio or video files, set `service.installFfmpeg=true` in the Helm chart when your cluster allows runtime package installation; for air-gapped clusters, see [Air-gapped and disconnected deployment](deployment-options.md#air-gapped-deployment) and the [Helm chart README](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#1-service-image) for `service.image` overrides. - -3. After the services are running, interact with the pipeline from Python. +2. After the services are running, interact with the pipeline from Python. - The `Ingestor` object initializes the ingestion process. - The `files` method specifies the input files to process. @@ -87,14 +81,11 @@ After deploy, call the pipeline from Python: asr_params=ASRParams(segment_audio=True), ) ) -) -``` - -To generate one extracted element for each sentence-like ASR segment, include `extract_audio_params={"segment_audio": True}` when calling `.extract(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model. + ``` - To generate one extracted element for each sentence-like ASR segment, pass `asr_params=ASRParams(segment_audio=True)` to `.extract_audio(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model. +To generate one extracted element for each sentence-like ASR segment, pass `asr_params=ASRParams(segment_audio=True)` to `.extract_audio(...)`. This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model. - For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb). +For more Python examples, refer to [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb). ## Parakeet with hosted inference (build.nvidia.com) { #parakeet-hosted-inference-build-nvidia } diff --git a/docs/docs/extraction/custom-metadata.md b/docs/docs/extraction/custom-metadata.md index bf47dcd1c5..02097bd14f 100644 --- a/docs/docs/extraction/custom-metadata.md +++ b/docs/docs/extraction/custom-metadata.md @@ -37,6 +37,10 @@ meta_df = pd.DataFrame( } ) +hostname = "localhost" +table_name = "nemo_retriever_collection" +lancedb_uri = "./lancedb_data" + ingestor = ( create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670") .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) @@ -59,6 +63,10 @@ results = ingestor.ingest_async().result() Merge values from `meta_df` (or `file_path`) into each document's `content_metadata` before `vdb_upload`, or follow the step-by-step pattern in [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb), so category, department, and timestamp are present on the chunks LanceDB indexes. +## How metadata is stored { #how-metadata-is-stored } + +During ingestion, each chunk's `content_metadata` is serialized as a **compact JSON string** (no spaces after `:` or `,`) in the LanceDB table's `metadata` column. Sidecar columns from `meta_dataframe`, `meta_source_field`, and `meta_fields` are merged into that JSON object before upload, so custom keys live in the same string—not separate columns. That is why `Retriever.query` filters often use `metadata LIKE '%\"key\":\"value\"%'`. For operator behavior and predicate examples, see [Vector DB operators and LanceDB — Metadata filtering](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering). + ## Best Practices The following are the best practices when you work with custom metadata: @@ -122,7 +130,7 @@ For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec. -## How metadata is stored { #how-metadata-is-stored } +## Related content { #related-content } - [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide - [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md index 6b3010db52..8e8a965caf 100644 --- a/docs/docs/extraction/faq.md +++ b/docs/docs/extraction/faq.md @@ -25,7 +25,7 @@ For chart-labeled PDF regions and other caption scope limits, see [Are PDF chart ## Are PDF chart or figure regions captioned when Omni is enabled? -No. Chart-labeled PDF regions are not routed through Omni captioning. See [Image captioning](prerequisites-support-matrix.md#image-captioning-2605) for scope, validation, and what the caption stage covers. +No. Chart-labeled PDF regions are not routed through Omni captioning. See [Charts and infographics](multimodal-extraction.md#charts-and-infographics) and [Image captioning](multimodal-extraction.md#image-captioning) for scope, validation, and what the caption stage covers. ## When should I consider advanced visual parsing? @@ -40,7 +40,7 @@ For more information, refer to [Nemotron Parse](https://build.nvidia.com/nvidia/ For [self-hosted deployments](deployment-options.md#when-to-self-host-nims), you should set the environment variables `NGC_API_KEY` and `NIM_NGC_API_KEY`. For more information, refer to [Authentication and API keys](api-keys.md). -For advanced scenarios, you might want to set environment variables for NIM container paths, tags, and batch sizes on the ingestion runtime. Configure them in your Helm values, Kubernetes `Secret`/`ConfigMap`, or follow [Environment variables](environment-config.md). +For advanced scenarios, you might want to set environment variables for NIM container paths, tags, and batch sizes on the ingestion runtime. Configure them in your Helm values, Kubernetes `Secret`/`ConfigMap`, or follow [Environment variables](environment-config.md). If you use **Docker Compose** locally for experiments only, see the unsupported developer page [docker.md](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docker.md) — **not** a supported deployment substitute for Helm. ### Library Mode diff --git a/docs/docs/extraction/multimodal-extraction.md b/docs/docs/extraction/multimodal-extraction.md index 5e6a5a4fb5..3b75d53bb2 100644 --- a/docs/docs/extraction/multimodal-extraction.md +++ b/docs/docs/extraction/multimodal-extraction.md @@ -49,7 +49,7 @@ NeMo Retriever Library detects tables as structured page elements, processes the Charts and infographic regions are classified with other page layout elements (tables, text blocks, titles) and processed through layout detection and OCR. `extract_charts` and `extract_infographics` are enabled by default. Outputs use the same metadata schema as other extracted objects. -Chart-labeled PDF regions are **not** routed through the Omni caption stage; they remain on the layout-and-OCR path. For scope and validation guidance, see [Image captioning](prerequisites-support-matrix.md#image-captioning-2605). +Chart-labeled PDF regions are **not** routed through the Omni caption stage; they remain on the layout-and-OCR path. For caption scope and validation, see [Image captioning](#image-captioning). For natural-language infographic descriptions, optionally enable [image captioning](#image-captioning) and set `caption_infographics=True` when you need VLM captions on infographic regions. @@ -63,7 +63,7 @@ For natural-language infographic descriptions, optionally enable [image captioni Scanned PDFs and image-only pages rely on OCR and hybrid paths that combine native text extraction with OCR when needed. For extract methods such as `ocr` and `pdfium_hybrid`, refer to the [Python API reference](nemo-retriever-api-reference.md). -OCR artifacts depend on how you deploy. **Helm / NIM:** the production chart uses **Nemotron OCR v1** (`nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0`). **Local Hugging Face inference:** the default engine is **Nemotron OCR v2**, which operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes defaults and the Helm-vs-local split, see [OCR artifacts (Helm vs local Hugging Face)](prerequisites-support-matrix.md#nemotron-ocr-v2-language-mode) in the support matrix. +When you run extraction locally with Hugging Face weights, the default OCR engine is **Nemotron OCR v2**, which operates in **multilingual** mode by default. For CLI flags and API parameters, see [Nemotron OCR v2 — language mode](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/docs/cli/README.md#nemotron-ocr-v2-language-mode). For Kubernetes deployment, see [OCR NIM configuration](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md#ocr-nim-configuration) in the Helm chart README. **Related** @@ -77,11 +77,22 @@ Image captioning generates natural-language descriptions for unstructured image **Captioning is optional** — enable it in your ingest configuration (for example, the `caption` API or pipeline flag) when you need natural-language descriptions of image content. Reasoning traces are disabled by default for captioning. +!!! important "PDF chart regions are not captioned by Omni" + + When **nemotron-page-elements-v3** classifies a PDF region as **chart**, that region is processed through layout detection and OCR—not the Omni caption stage. Enabling the caption NIM and the `caption` pipeline stage does **not** send chart-labeled figures to `/v1/chat/completions`. + + The caption stage covers: + + - Unstructured content in the `images` column (standalone image files and page-element regions **not** classified as table, chart, or infographic) + - Optional infographic regions when you set `caption_infographics=True` on `CaptionParams` (the VLM caption is stored in `caption`, separate from OCR `text`) + + To validate caption traffic during ingest, inspect metadata such as `page_elements_v3_counts_by_label`. If the figure is labeled `chart`, expect no Omni chat-completions requests for that region even when captioning is enabled. + **Related** - [Multimodal embeddings (VLM)](embedding.md) - [Metadata reference](content-metadata.md) -- [Image captioning](prerequisites-support-matrix.md#image-captioning-2605) +- [Image captioning — NIM and hardware](prerequisites-support-matrix.md#image-captioning-2605) ## Metadata and content schema { #metadata-and-content-schema } diff --git a/docs/docs/extraction/prerequisites-support-matrix.md b/docs/docs/extraction/prerequisites-support-matrix.md index b8910f2ea8..9f97db16f6 100644 --- a/docs/docs/extraction/prerequisites-support-matrix.md +++ b/docs/docs/extraction/prerequisites-support-matrix.md @@ -2,7 +2,7 @@ Before you begin using [NeMo Retriever Library](overview.md), confirm your software stack, deployment hardware, and—if you use them—advanced features (audio and video, Nemotron Parse, VLM image captioning, reranking) against the guidance in this page. -## Software Requirements +## Software Requirements { #software-requirements } - Linux operating systems (Ubuntu 22.04 or later recommended) - [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) (NVIDIA Driver >= `580`, CUDA >= `13.0`) @@ -64,7 +64,7 @@ Ensure your deployment environment meets these specifications before running the The NeMo Retriever Library extraction core pipeline features run on a single A10G or better GPU. -### Default Helm NIMs +### Default Helm NIMs { #default-helm-nims } The production Helm chart enables these NIM microservices **by default** (for example via `nimOperator.*.enabled=true`): @@ -107,22 +107,11 @@ These NIM microservices are **optional** for the default extraction pipeline. Th For 26.05, use **`nemotron_3_nano_omni_30b_a3b_reasoning`** when you enable the caption stage (hosted model ID `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`). The Helm key is in the [optional NIMs](#optional-helm-nims-not-auto-wired-by-default) table above. -!!! important "PDF chart regions are not captioned by Omni" - - When **nemotron-page-elements-v3** classifies a PDF region as **chart**, that region is processed through layout detection and OCR—not the Omni caption stage. Enabling the caption NIM and the `caption` pipeline stage does **not** send chart-labeled figures to `/v1/chat/completions`. - - The caption stage covers: - - - Unstructured content in the `images` column (standalone image files and page-element regions **not** classified as table, chart, or infographic) - - Optional infographic regions when you set `caption_infographics=True` on `CaptionParams` (the VLM caption is stored in `caption`, separate from OCR `text`) - - To validate caption traffic during ingest, inspect metadata such as `page_elements_v3_counts_by_label`. If the figure is labeled `chart`, expect no Omni chat-completions requests for that region even when captioning is enabled. - Optional features listed in the table above require additional GPU support, disk space, and feature-specific system dependencies beyond the four default NIMs. For published NIM model IDs and deployment-specific constraints, use the product support matrices linked under [Related Topics](#related-topics) below. -## Model Hardware Requirements +## Model Hardware Requirements { #model-hardware-requirements } NeMo Retriever Library supports the following GPU hardware given system constraints in the table. From 9ae44ac7e93788ef3ccc78e364f1bab8588f7375 Mon Sep 17 00:00:00 2001 From: Kurt Heiss Date: Fri, 5 Jun 2026 10:52:05 -0700 Subject: [PATCH 2/2] =?UTF-8?q?docs(extraction):=20address=20PR=20#2203=20?= =?UTF-8?q?review=20=E2=80=94=20drop=20custom-metadata=20page,=20remove=20?= =?UTF-8?q?chart=20admonition?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove custom-metadata.md in favor of vdbs.md#metadata-and-filtering and the metadata filtering notebook. Drop the PDF chart caption admonition from multimodal-extraction.md per review feedback. --- docs/docs/extraction/custom-metadata.md | 136 ------------------ ...egrations-langchain-llamaindex-haystack.md | 2 +- docs/docs/extraction/multimodal-extraction.md | 11 -- docs/docs/extraction/vdbs.md | 11 +- .../extraction/workflow-agentic-retrieval.md | 2 +- docs/mkdocs.yml | 15 +- .../tests/test_src_documentation_snippets.py | 1 - 7 files changed, 15 insertions(+), 163 deletions(-) delete mode 100644 docs/docs/extraction/custom-metadata.md diff --git a/docs/docs/extraction/custom-metadata.md b/docs/docs/extraction/custom-metadata.md deleted file mode 100644 index 02097bd14f..0000000000 --- a/docs/docs/extraction/custom-metadata.md +++ /dev/null @@ -1,136 +0,0 @@ -# Custom metadata and filtering - -Use this documentation to attach per-document metadata during ingestion and to narrow [LanceDB](vdbs.md) search results in [NeMo Retriever Library](overview.md). Implementation details live in the package [Vector DB operators and LanceDB](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) README. - -## On this page { #on-this-page } - -- [Attach metadata at ingestion](#attach-metadata-at-ingestion) -- [How metadata is stored](#how-metadata-is-stored) -- [Filter results at query time](#filter-results-at-query-time) -- [Writing `where` predicates](#writing-where-predicates) -- [Server-side vs client-side filters](#server-side-vs-client-side-filters) -- [Inspect hit metadata](#inspect-hit-metadata) -- [Limitations](#limitations) -- [Related content](#related-content) - -## Attach metadata at ingestion { #attach-metadata-at-ingestion } - -Pass a **sidecar metadata table** on `vdb_upload` so selected columns are merged into each chunk's `content_metadata` before LanceDB upload. All three parameters must be set together: - -| Parameter | Purpose | -|-----------|---------| -| `meta_dataframe` | Path to CSV, JSON, or Parquet, or an in-memory `pandas.DataFrame` | -| `meta_source_field` | Column that identifies each document (must match ingest paths or basenames per `meta_join_key`) | -| `meta_fields` | Non-empty list of column names to copy into `content_metadata` | - -Optional `meta_join_key` controls how rows are matched to documents: `auto` (try full path then basename), `source_id` (full path), or `source_name` (basename only). - -```python -import pandas as pd -from nemo_retriever import create_ingestor - -meta_df = pd.DataFrame( - { - "source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"], - "meta_a": ["alpha", "bravo"], - "meta_b": [10, 20], - } -) - -hostname = "localhost" -table_name = "nemo_retriever_collection" -lancedb_uri = "./lancedb_data" - -ingestor = ( - create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670") - .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) - .extract( - extract_text=True, - extract_tables=True, - extract_charts=True, - extract_images=True, - text_depth="page" - ) - .embed() - .vdb_upload( - vdb_op="lancedb", - uri=lancedb_uri, - table_name=table_name, - ) -) -results = ingestor.ingest_async().result() -``` - -Merge values from `meta_df` (or `file_path`) into each document's `content_metadata` before `vdb_upload`, or follow the step-by-step pattern in [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb), so category, department, and timestamp are present on the chunks LanceDB indexes. - -## How metadata is stored { #how-metadata-is-stored } - -During ingestion, each chunk's `content_metadata` is serialized as a **compact JSON string** (no spaces after `:` or `,`) in the LanceDB table's `metadata` column. Sidecar columns from `meta_dataframe`, `meta_source_field`, and `meta_fields` are merged into that JSON object before upload, so custom keys live in the same string—not separate columns. That is why `Retriever.query` filters often use `metadata LIKE '%\"key\":\"value\"%'`. For operator behavior and predicate examples, see [Vector DB operators and LanceDB — Metadata filtering](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering). - -## Best Practices - -The following are the best practices when you work with custom metadata: - -- Plan metadata structure before ingestion. -- Test filter expressions with small datasets first. -- Consider performance implications of complex filters. -- Validate metadata during ingestion. -- Handle missing metadata fields gracefully. -- Log invalid filter expressions. - - - -## Use Custom Metadata to Filter Results During Retrieval - -You can use custom metadata to filter documents during retrieval operations. -For **predicate pushdown**, pass a `where` SQL predicate through [`Retriever.query`](nemo-retriever-api-reference.md) (see [Vector databases](vdbs.md)) or chain `.where(...)` on a native LanceDB `table.search(...)` query. Application-side filtering on returned hits does not change what the database evaluates—raise `top_k` if matches might sit outside the first neighbors. - - -### Example filter ideas - -Typical keys to filter on include `category`, `department`, `priority`, and `timestamp` (use comparable ISO-8601 strings for time ranges). Encode predicates in LanceDB SQL against your table columns (often the serialized `metadata` string), or inspect parsed hit metadata after search as in the example below. - -### Example: Use a Filter Expression in Search - -After ingestion is complete, and documents are uploaded to LanceDB with metadata, -you can narrow results in the database with a **`where`** clause, or in Python on the returned hits. - -**Native LanceDB (SQL pushdown):** connect, embed the query yourself (same model as ingestion), then chain `.where("")` on `table.search(...)` so filtering happens before the `limit`. Exact SQL depends on how `metadata` is stored; see [LanceDB SQL](https://lancedb.github.io/lancedb/sql/). - -```python -import lancedb - -# Pseudocode sketch — replace YOUR_VECTOR and YOUR_PREDICATE with real values. -db = lancedb.connect("./lancedb_data") -table = db.open_table("nemo_retriever_collection") -# table.search(YOUR_VECTOR, vector_column_name="vector").where(YOUR_PREDICATE).limit(10).to_list() -``` - -**`Retriever.query` + `where`:** LanceDB applies the predicate before ranking. For post-filter logic in Python, use a wider `top_k` first. - -```python -from nemo_retriever.retriever import Retriever - -retriever = Retriever( - vdb_kwargs={"uri": "./lancedb_data", "table_name": "nemo_retriever_collection"}, - embed_kwargs={ - "model_name": "nvidia/llama-nemotron-embed-1b-v2", - "embed_model_name": "nvidia/llama-nemotron-embed-1b-v2", - }, -) - -hits = retriever.query( - "this is expensive", - top_k=16, - vdb_kwargs={"where": "metadata LIKE '%\"department\":\"Engineering\"%'"}, -) -``` - -For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes), see [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb). - -When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec. - -## Related content { #related-content } - -- [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide -- [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata diff --git a/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md b/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md index 7ee0dda650..a71d24c0dd 100644 --- a/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md +++ b/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md @@ -19,4 +19,4 @@ Haystack-related extraction modes may appear in API tables as **deprecated** in - [Use the Python API](nemo-retriever-api-reference.md) - [Use the CLI](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/docs/cli) -- [Chunking](concepts.md#chunking), [Upload data](vdbs.md), [Filter search](custom-metadata.md) +- [Chunking](concepts.md#chunking), [Upload data](vdbs.md), [Filter search](vdbs.md#metadata-and-filtering) diff --git a/docs/docs/extraction/multimodal-extraction.md b/docs/docs/extraction/multimodal-extraction.md index 3b75d53bb2..0f73ce70d2 100644 --- a/docs/docs/extraction/multimodal-extraction.md +++ b/docs/docs/extraction/multimodal-extraction.md @@ -77,17 +77,6 @@ Image captioning generates natural-language descriptions for unstructured image **Captioning is optional** — enable it in your ingest configuration (for example, the `caption` API or pipeline flag) when you need natural-language descriptions of image content. Reasoning traces are disabled by default for captioning. -!!! important "PDF chart regions are not captioned by Omni" - - When **nemotron-page-elements-v3** classifies a PDF region as **chart**, that region is processed through layout detection and OCR—not the Omni caption stage. Enabling the caption NIM and the `caption` pipeline stage does **not** send chart-labeled figures to `/v1/chat/completions`. - - The caption stage covers: - - - Unstructured content in the `images` column (standalone image files and page-element regions **not** classified as table, chart, or infographic) - - Optional infographic regions when you set `caption_infographics=True` on `CaptionParams` (the VLM caption is stored in `caption`, separate from OCR `text`) - - To validate caption traffic during ingest, inspect metadata such as `page_elements_v3_counts_by_label`. If the figure is labeled `chart`, expect no Omni chat-completions requests for that region even when captioning is enabled. - **Related** - [Multimodal embeddings (VLM)](embedding.md) diff --git a/docs/docs/extraction/vdbs.md b/docs/docs/extraction/vdbs.md index be91716953..bcf4280af1 100644 --- a/docs/docs/extraction/vdbs.md +++ b/docs/docs/extraction/vdbs.md @@ -89,10 +89,11 @@ Semantic retrieval uses dense embeddings to find content that is similar in mean ## Metadata and filtering { #metadata-and-filtering } -This page covers LanceDB upload and retrieval. **Metadata is not duplicated here.** +Attach sidecar metadata at ingest with `meta_dataframe`, `meta_source_field`, and `meta_fields` on `vdb_upload`. Values merge into each chunk's `content_metadata` and are stored as compact JSON in LanceDB. Narrow results at query time with server-side `where` on [`Retriever.query`](nemo-retriever-api-reference.md) or client-side `filter_hits_by_content_metadata`. -- **Published guide** — [Custom metadata and filtering](custom-metadata.md) (sidecar `meta_*` on `vdb_upload`, compact JSON in LanceDB, server-side `where` on `Retriever.query`, and client-side `filter_hits_by_content_metadata`). -- **Canonical reference** — [Vector DB operators and LanceDB — Metadata filtering](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) in `nemo_retriever/src/nemo_retriever/vdb/README.md` (operator behavior and examples). +- [Metadata filtering notebook](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) — end-to-end ingest, `Retriever.query`, and both filter modes +- [Sidecar metadata ingest](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph workflow +- [VDB README (metadata filtering)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) — operator behavior, SQL predicates, and examples ## LanceDB deployment characteristics { #lancedb-deployment-characteristics } @@ -142,7 +143,7 @@ Testing and release cadence for these integrations follow the owning project (RA ### More information (embeddings & custom `VDB`) { #vector-database-partners-more-info } -- [Custom metadata and filtering](custom-metadata.md) and the package [VDB README (metadata filtering)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) +- [Metadata filtering notebook](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) and the package [VDB README (metadata filtering)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) - [Multimodal embeddings (VLM)](embedding.md) - [NeMo Retriever Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html) - [NVIDIA NIM catalog](https://build.nvidia.com/) for embedding and retrieval-related NIMs @@ -155,7 +156,7 @@ To implement a custom operator, follow the `VDB` abstract interface described in ## Related Topics { #related-topics } -- [Custom metadata and filtering](custom-metadata.md) +- [Metadata filtering: add sidecar metadata and filter searches](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) - [Vector DB operators and LanceDB (source)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb) - [Use the NeMo Retriever Library Python API](nemo-retriever-api-reference.md) - [Store Extracted Images](nemo-retriever-api-reference.md) diff --git a/docs/docs/extraction/workflow-agentic-retrieval.md b/docs/docs/extraction/workflow-agentic-retrieval.md index 78b48cdc47..ea8a83c17d 100644 --- a/docs/docs/extraction/workflow-agentic-retrieval.md +++ b/docs/docs/extraction/workflow-agentic-retrieval.md @@ -8,7 +8,7 @@ NeMo Retriever Library provides ingestion, embedding, storage, and retrieval bui Use these pages together with your orchestration layer: -- [Semantic retrieval](vdbs.md#semantic-retrieval), [Custom metadata and filtering](custom-metadata.md), and [Evaluate on your data](evaluate-on-your-data.md) for retrieval quality and reranking notes +- [Semantic retrieval](vdbs.md#semantic-retrieval), [Metadata and filtering](vdbs.md#metadata-and-filtering), and [Evaluate on your data](evaluate-on-your-data.md) for retrieval quality and reranking notes - [Agentic retrieval (concept)](agentic-retrieval-concept.md) - [Evaluate on your data](evaluate-on-your-data.md), which includes retrieval evaluation guidance - [Release notes](releasenotes.md), which may mention agentic retrieval updates diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 6300e6c41a..3f72ea764f 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -95,26 +95,24 @@ nav: # Single vector-DB page (vdbs.md). Deep links: in-page "On this page" TOC and redirects # (for example extraction/vector-db-partners.md → vdbs.md#vector-database-partners). - "Vector databases": extraction/vdbs.md - - "7. Retrieval & ranking": - - "Custom metadata and filtering": extraction/custom-metadata.md - - "8. Deployment & operations": + - "7. Deployment & operations": - "Ray and distributed ingest": extraction/ray-logging.md - - "9. Customize & extend": + - "8. Customize & extend": - Extending/Customizing NeMo Retriever Library with custom code: https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/graph#nemo-retriever-graph - "NimClient and custom NIM endpoints": extraction/nimclient.md - - "10. Integrations & ecosystem": + - "9. Integrations & ecosystem": - "Framework integrations": extraction/integrations-langchain-llamaindex-haystack.md - "Starter kits": extraction/notebooks/index.md - - "11. Evaluation & benchmarks": + - "10. Evaluation & benchmarks": - "Evaluate on your own documents": extraction/evaluate-on-your-data.md - - "12. Reference": + - "11. Reference": - "API guide": extraction/nemo-retriever-api-reference.md # TODO: after nv-ingest code removal, update this link when CLI docs are relocated. - "CLI reference": https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/docs/cli - "Quickstart: retriever CLI": reference/retriever-cli-quickstart.md - Environment variables: extraction/environment-config.md - "Metadata reference": extraction/content-metadata.md - - "13. Support & community": + - "12. Support & community": - Troubleshooting: extraction/troubleshoot.md - FAQ: extraction/faq.md - Contributing: extraction/contributing.md @@ -161,6 +159,7 @@ plugins: extraction/ngc-api-key.md: extraction/api-keys.md extraction/notebooks.md: extraction/notebooks/index.md extraction/data-store.md: extraction/vdbs.md + extraction/custom-metadata.md: extraction/vdbs.md#metadata-and-filtering extraction/nemoretriever-parse.md: extraction/multimodal-extraction.md#text-and-layout-extraction extraction/supported-file-types.md: extraction/multimodal-extraction.md#supported-file-types-and-formats extraction/text-layout-extraction.md: extraction/multimodal-extraction.md#text-and-layout-extraction diff --git a/nemo_retriever/tests/test_src_documentation_snippets.py b/nemo_retriever/tests/test_src_documentation_snippets.py index dd73a713e3..0b92337a6d 100644 --- a/nemo_retriever/tests/test_src_documentation_snippets.py +++ b/nemo_retriever/tests/test_src_documentation_snippets.py @@ -52,7 +52,6 @@ def _iter_markdown_python_blocks() -> list[tuple[str, str]]: _MD_BLOCKS = _iter_markdown_python_blocks() _PUBLIC_RETRIEVER_DOCS = ( "README.md", - "docs/docs/extraction/custom-metadata.md", "examples/nemo_retriever_retriever_query_metadata_filter.ipynb", "nemo_retriever/README.md", "nemo_retriever/docs/cli/README.md",