diff --git a/docs/docs/extraction/agentic-retrieval-concept.md b/docs/docs/extraction/agentic-retrieval-concept.md index a06431a359..4a577dc2c3 100644 --- a/docs/docs/extraction/agentic-retrieval-concept.md +++ b/docs/docs/extraction/agentic-retrieval-concept.md @@ -7,4 +7,4 @@ NeMo Retriever Library focuses on document ingestion, embeddings, vector stores, **Related** - [Semantic retrieval](vdbs.md#semantic-retrieval) -- Framework examples: [LangChain, LlamaIndex, Haystack](integrations-langchain-llamaindex-haystack.md) +- Framework examples: [Jupyter Notebooks](notebooks/index.md) diff --git a/docs/docs/extraction/custom-metadata.md b/docs/docs/extraction/custom-metadata.md deleted file mode 100644 index bf47dcd1c5..0000000000 --- a/docs/docs/extraction/custom-metadata.md +++ /dev/null @@ -1,128 +0,0 @@ -# Custom metadata and filtering - -Use this documentation to attach per-document metadata during ingestion and to narrow [LanceDB](vdbs.md) search results in [NeMo Retriever Library](overview.md). Implementation details live in the package [Vector DB operators and LanceDB](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) README. - -## On this page { #on-this-page } - -- [Attach metadata at ingestion](#attach-metadata-at-ingestion) -- [How metadata is stored](#how-metadata-is-stored) -- [Filter results at query time](#filter-results-at-query-time) -- [Writing `where` predicates](#writing-where-predicates) -- [Server-side vs client-side filters](#server-side-vs-client-side-filters) -- [Inspect hit metadata](#inspect-hit-metadata) -- [Limitations](#limitations) -- [Related content](#related-content) - -## Attach metadata at ingestion { #attach-metadata-at-ingestion } - -Pass a **sidecar metadata table** on `vdb_upload` so selected columns are merged into each chunk's `content_metadata` before LanceDB upload. All three parameters must be set together: - -| Parameter | Purpose | -|-----------|---------| -| `meta_dataframe` | Path to CSV, JSON, or Parquet, or an in-memory `pandas.DataFrame` | -| `meta_source_field` | Column that identifies each document (must match ingest paths or basenames per `meta_join_key`) | -| `meta_fields` | Non-empty list of column names to copy into `content_metadata` | - -Optional `meta_join_key` controls how rows are matched to documents: `auto` (try full path then basename), `source_id` (full path), or `source_name` (basename only). - -```python -import pandas as pd -from nemo_retriever import create_ingestor - -meta_df = pd.DataFrame( - { - "source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"], - "meta_a": ["alpha", "bravo"], - "meta_b": [10, 20], - } -) - -ingestor = ( - create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670") - .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"]) - .extract( - extract_text=True, - extract_tables=True, - extract_charts=True, - extract_images=True, - text_depth="page" - ) - .embed() - .vdb_upload( - vdb_op="lancedb", - uri=lancedb_uri, - table_name=table_name, - ) -) -results = ingestor.ingest_async().result() -``` - -Merge values from `meta_df` (or `file_path`) into each document's `content_metadata` before `vdb_upload`, or follow the step-by-step pattern in [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb), so category, department, and timestamp are present on the chunks LanceDB indexes. - -## Best Practices - -The following are the best practices when you work with custom metadata: - -- Plan metadata structure before ingestion. -- Test filter expressions with small datasets first. -- Consider performance implications of complex filters. -- Validate metadata during ingestion. -- Handle missing metadata fields gracefully. -- Log invalid filter expressions. - - - -## Use Custom Metadata to Filter Results During Retrieval - -You can use custom metadata to filter documents during retrieval operations. -For **predicate pushdown**, pass a `where` SQL predicate through [`Retriever.query`](nemo-retriever-api-reference.md) (see [Vector databases](vdbs.md)) or chain `.where(...)` on a native LanceDB `table.search(...)` query. Application-side filtering on returned hits does not change what the database evaluates—raise `top_k` if matches might sit outside the first neighbors. - - -### Example filter ideas - -Typical keys to filter on include `category`, `department`, `priority`, and `timestamp` (use comparable ISO-8601 strings for time ranges). Encode predicates in LanceDB SQL against your table columns (often the serialized `metadata` string), or inspect parsed hit metadata after search as in the example below. - -### Example: Use a Filter Expression in Search - -After ingestion is complete, and documents are uploaded to LanceDB with metadata, -you can narrow results in the database with a **`where`** clause, or in Python on the returned hits. - -**Native LanceDB (SQL pushdown):** connect, embed the query yourself (same model as ingestion), then chain `.where("")` on `table.search(...)` so filtering happens before the `limit`. Exact SQL depends on how `metadata` is stored; see [LanceDB SQL](https://lancedb.github.io/lancedb/sql/). - -```python -import lancedb - -# Pseudocode sketch — replace YOUR_VECTOR and YOUR_PREDICATE with real values. -db = lancedb.connect("./lancedb_data") -table = db.open_table("nemo_retriever_collection") -# table.search(YOUR_VECTOR, vector_column_name="vector").where(YOUR_PREDICATE).limit(10).to_list() -``` - -**`Retriever.query` + `where`:** LanceDB applies the predicate before ranking. For post-filter logic in Python, use a wider `top_k` first. - -```python -from nemo_retriever.retriever import Retriever - -retriever = Retriever( - vdb_kwargs={"uri": "./lancedb_data", "table_name": "nemo_retriever_collection"}, - embed_kwargs={ - "model_name": "nvidia/llama-nemotron-embed-1b-v2", - "embed_model_name": "nvidia/llama-nemotron-embed-1b-v2", - }, -) - -hits = retriever.query( - "this is expensive", - top_k=16, - vdb_kwargs={"where": "metadata LIKE '%\"department\":\"Engineering\"%'"}, -) -``` - -For a runnable end-to-end flow (ingest, `Retriever.query`, and both filter modes), see [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb). - -When you ingest through the **retriever service**, upload the sidecar with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py#L1040-L1129) (multipart file; response [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py#L15-L32) ([`PipelineSpec`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/pipeline_spec.py#L55-L78)). Request and response shapes, form fields, and auth headers are in the service OpenAPI UI at `/docs` (or `/openapi.json`) on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). Do not send a raw local path as `meta_dataframe` on the service spec. - -## How metadata is stored { #how-metadata-is-stored } - -- [Vector databases](vdbs.md) — canonical LanceDB upload and retrieval guide -- [metadata_and_filtered_search.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) — CLI and graph ingest with sidecar metadata diff --git a/docs/docs/extraction/deployment-options.md b/docs/docs/extraction/deployment-options.md index 71f0b75109..b1d0a36142 100644 --- a/docs/docs/extraction/deployment-options.md +++ b/docs/docs/extraction/deployment-options.md @@ -36,7 +36,6 @@ environments), use a custom service image that already contains `ffmpeg` and ### I want examples and notebooks 1. [Jupyter Notebooks](notebooks/index.md) -2. [Integrate with LangChain, LlamaIndex, Haystack](integrations-langchain-llamaindex-haystack.md) ### I need API details and keys diff --git a/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md b/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md deleted file mode 100644 index 7ee0dda650..0000000000 --- a/docs/docs/extraction/integrations-langchain-llamaindex-haystack.md +++ /dev/null @@ -1,22 +0,0 @@ -# Integrate with LangChain, LlamaIndex, and Haystack - -NeMo Retriever Library is commonly used **behind** retrieval-augmented generation (RAG) apps built with popular orchestration frameworks. - -## Jupyter examples (LangChain and LlamaIndex) - -The repository includes notebooks that demonstrate multimodal RAG patterns: - -- [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb) -- [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb) - -These are also linked from [Jupyter Notebooks](notebooks/index.md) and the [FAQ](faq.md). - -## Haystack - -Haystack-related extraction modes may appear in API tables as **deprecated** in favor of current pipeline options. For up-to-date integration patterns, prefer the Python API and CLI docs, and check [Release notes](releasenotes.md) for migration notes. - -## Related - -- [Use the Python API](nemo-retriever-api-reference.md) -- [Use the CLI](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/docs/cli) -- [Chunking](concepts.md#chunking), [Upload data](vdbs.md), [Filter search](custom-metadata.md) diff --git a/docs/docs/extraction/notebooks/index.md b/docs/docs/extraction/notebooks/index.md index 54a9787503..e1bf7fb328 100644 --- a/docs/docs/extraction/notebooks/index.md +++ b/docs/docs/extraction/notebooks/index.md @@ -12,7 +12,7 @@ To get started with the basics, try one of the following notebooks: - [CLI Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/cli_client_usage.ipynb) - [Python Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb) -- [How to add metadata to your documents and filter searches](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb) +- [Metadata filtering: add sidecar metadata and filter searches](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) - [How to reindex a collection](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/reindex_example.ipynb) diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md index f6266aa0d2..98acf286c9 100644 --- a/docs/docs/extraction/overview.md +++ b/docs/docs/extraction/overview.md @@ -50,4 +50,4 @@ NeMo Retriever Library supports the following file types: - [Deploy on Kubernetes with Helm](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/helm/README.md) - [NeMo Retriever Library — prerequisites / deployment](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) (supported Helm charts) - [Notebooks](notebooks/index.md) -- [NVIDIA AI Blueprints catalog](https://build.nvidia.com/explore/discover) — solution cards, enterprise RAG blueprints, and end-to-end patterns (including [Enterprise RAG — multimodal PDF data extraction](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)); for integration pathways, refer to [Integrations](integrations-langchain-llamaindex-haystack.md). +- [NVIDIA AI Blueprints catalog](https://build.nvidia.com/explore/discover) — solution cards, enterprise RAG blueprints, and end-to-end patterns (including [Enterprise RAG — multimodal PDF data extraction](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)); for integration pathways, refer to [Jupyter Notebooks](notebooks/index.md). diff --git a/docs/docs/extraction/vdbs.md b/docs/docs/extraction/vdbs.md index be91716953..0da68f28fc 100644 --- a/docs/docs/extraction/vdbs.md +++ b/docs/docs/extraction/vdbs.md @@ -89,10 +89,15 @@ Semantic retrieval uses dense embeddings to find content that is similar in mean ## Metadata and filtering { #metadata-and-filtering } -This page covers LanceDB upload and retrieval. **Metadata is not duplicated here.** +Attach per-document metadata during ingestion and narrow LanceDB results at query time. -- **Published guide** — [Custom metadata and filtering](custom-metadata.md) (sidecar `meta_*` on `vdb_upload`, compact JSON in LanceDB, server-side `where` on `Retriever.query`, and client-side `filter_hits_by_content_metadata`). -- **Canonical reference** — [Vector DB operators and LanceDB — Metadata filtering](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) in `nemo_retriever/src/nemo_retriever/vdb/README.md` (operator behavior and examples). +Pass a **sidecar metadata table** on `vdb_upload` with `meta_dataframe`, `meta_source_field`, and `meta_fields` (all three required). Selected columns merge into each chunk's `content_metadata` before upload. During upload, that object is serialized as **compact JSON** in the LanceDB `metadata` column. Filter with server-side `where` on [`Retriever.query`](nemo-retriever-api-reference.md) or client-side `filter_hits_by_content_metadata`. + +**Worked example** — [nemo_retriever_retriever_query_metadata_filter.ipynb](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) — sidecar metadata at ingest, `Retriever.query` with `where`, and client-side filters. + +**Retriever service** — Upload the sidecar file with [`POST /v1/ingest/sidecar`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/routers/ingest.py) (multipart upload; refer to [`SidecarUploadResponse`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/responses.py#L60-L68)), then pass the returned `sidecar_id` as `meta_dataframe_id` with `meta_source_field` and `meta_fields` in `pipeline.vdb_upload_params` on [`POST /v1/ingest`](https://github.com/NVIDIA/NeMo-Retriever/blob/main/nemo_retriever/src/nemo_retriever/service/models/requests.py). Do not pass a local filesystem path as `meta_dataframe` in the service spec. Request shapes and form fields are in the OpenAPI UI at `/docs` on your retriever base URL (for example `http://localhost:7670/docs` after `retriever service start`). + +For parameter tables, SQL predicate patterns, and operator behavior, refer to [Vector DB operators and LanceDB — Metadata filtering](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) in `nemo_retriever/src/nemo_retriever/vdb/README.md`. The worked example notebook is also listed on [Notebooks for NeMo Retriever Library](notebooks/index.md). ## LanceDB deployment characteristics { #lancedb-deployment-characteristics } @@ -142,7 +147,7 @@ Testing and release cadence for these integrations follow the owning project (RA ### More information (embeddings & custom `VDB`) { #vector-database-partners-more-info } -- [Custom metadata and filtering](custom-metadata.md) and the package [VDB README (metadata filtering)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) +- [Metadata filtering notebook](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) and the package [VDB README (metadata filtering)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb#metadata-filtering) - [Multimodal embeddings (VLM)](embedding.md) - [NeMo Retriever Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html) - [NVIDIA NIM catalog](https://build.nvidia.com/) for embedding and retrieval-related NIMs @@ -155,7 +160,7 @@ To implement a custom operator, follow the `VDB` abstract interface described in ## Related Topics { #related-topics } -- [Custom metadata and filtering](custom-metadata.md) +- [Metadata filtering: add sidecar metadata and filter searches](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/nemo_retriever_retriever_query_metadata_filter.ipynb) - [Vector DB operators and LanceDB (source)](https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/vdb) - [Use the NeMo Retriever Library Python API](nemo-retriever-api-reference.md) - [Store Extracted Images](nemo-retriever-api-reference.md) diff --git a/docs/docs/extraction/workflow-agentic-retrieval.md b/docs/docs/extraction/workflow-agentic-retrieval.md index 78b48cdc47..ea8a83c17d 100644 --- a/docs/docs/extraction/workflow-agentic-retrieval.md +++ b/docs/docs/extraction/workflow-agentic-retrieval.md @@ -8,7 +8,7 @@ NeMo Retriever Library provides ingestion, embedding, storage, and retrieval bui Use these pages together with your orchestration layer: -- [Semantic retrieval](vdbs.md#semantic-retrieval), [Custom metadata and filtering](custom-metadata.md), and [Evaluate on your data](evaluate-on-your-data.md) for retrieval quality and reranking notes +- [Semantic retrieval](vdbs.md#semantic-retrieval), [Metadata and filtering](vdbs.md#metadata-and-filtering), and [Evaluate on your data](evaluate-on-your-data.md) for retrieval quality and reranking notes - [Agentic retrieval (concept)](agentic-retrieval-concept.md) - [Evaluate on your data](evaluate-on-your-data.md), which includes retrieval evaluation guidance - [Release notes](releasenotes.md), which may mention agentic retrieval updates diff --git a/docs/docs/extraction/workflow-e2e-blueprints.md b/docs/docs/extraction/workflow-e2e-blueprints.md index 16aa4bb3d6..7e86f5a80d 100644 --- a/docs/docs/extraction/workflow-e2e-blueprints.md +++ b/docs/docs/extraction/workflow-e2e-blueprints.md @@ -5,4 +5,4 @@ Use these external resources for end-to-end RAG implementations with NeMo Retrie - [Enterprise RAG - multimodal PDF data extraction](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) - [NVIDIA AI Blueprints catalog](https://build.nvidia.com/explore/discover) -For framework-specific integration patterns, see [Framework integrations](integrations-langchain-llamaindex-haystack.md). +For framework-specific integration patterns, see [Jupyter Notebooks](notebooks/index.md). diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 6300e6c41a..7d7f82904d 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -95,26 +95,23 @@ nav: # Single vector-DB page (vdbs.md). Deep links: in-page "On this page" TOC and redirects # (for example extraction/vector-db-partners.md → vdbs.md#vector-database-partners). - "Vector databases": extraction/vdbs.md - - "7. Retrieval & ranking": - - "Custom metadata and filtering": extraction/custom-metadata.md - - "8. Deployment & operations": + - "7. Deployment & operations": - "Ray and distributed ingest": extraction/ray-logging.md - - "9. Customize & extend": + - "8. Customize & extend": - Extending/Customizing NeMo Retriever Library with custom code: https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/src/nemo_retriever/graph#nemo-retriever-graph - "NimClient and custom NIM endpoints": extraction/nimclient.md - - "10. Integrations & ecosystem": - - "Framework integrations": extraction/integrations-langchain-llamaindex-haystack.md + - "9. Integrations & ecosystem": - "Starter kits": extraction/notebooks/index.md - - "11. Evaluation & benchmarks": + - "10. Evaluation & benchmarks": - "Evaluate on your own documents": extraction/evaluate-on-your-data.md - - "12. Reference": + - "11. Reference": - "API guide": extraction/nemo-retriever-api-reference.md # TODO: after nv-ingest code removal, update this link when CLI docs are relocated. - "CLI reference": https://github.com/NVIDIA/NeMo-Retriever/tree/main/nemo_retriever/docs/cli - "Quickstart: retriever CLI": reference/retriever-cli-quickstart.md - Environment variables: extraction/environment-config.md - "Metadata reference": extraction/content-metadata.md - - "13. Support & community": + - "12. Support & community": - Troubleshooting: extraction/troubleshoot.md - FAQ: extraction/faq.md - Contributing: extraction/contributing.md @@ -161,6 +158,8 @@ plugins: extraction/ngc-api-key.md: extraction/api-keys.md extraction/notebooks.md: extraction/notebooks/index.md extraction/data-store.md: extraction/vdbs.md + extraction/custom-metadata.md: extraction/vdbs.md#metadata-and-filtering + extraction/integrations-langchain-llamaindex-haystack.md: extraction/notebooks/index.md extraction/nemoretriever-parse.md: extraction/multimodal-extraction.md#text-and-layout-extraction extraction/supported-file-types.md: extraction/multimodal-extraction.md#supported-file-types-and-formats extraction/text-layout-extraction.md: extraction/multimodal-extraction.md#text-and-layout-extraction diff --git a/nemo_retriever/tests/test_src_documentation_snippets.py b/nemo_retriever/tests/test_src_documentation_snippets.py index dd73a713e3..924a45bdb8 100644 --- a/nemo_retriever/tests/test_src_documentation_snippets.py +++ b/nemo_retriever/tests/test_src_documentation_snippets.py @@ -52,7 +52,7 @@ def _iter_markdown_python_blocks() -> list[tuple[str, str]]: _MD_BLOCKS = _iter_markdown_python_blocks() _PUBLIC_RETRIEVER_DOCS = ( "README.md", - "docs/docs/extraction/custom-metadata.md", + "docs/docs/extraction/vdbs.md", "examples/nemo_retriever_retriever_query_metadata_filter.ipynb", "nemo_retriever/README.md", "nemo_retriever/docs/cli/README.md",