Adding Agentic Retrieval as a new retrieveral mode by mahikaw · Pull Request #2018 · NVIDIA/NeMo-Retriever

mahikaw · 2026-05-12T00:12:54Z

Description

Agentic retrieval mode + BEIR / query-CSV evaluation

Summary

Adds an LLM-driven agentic retrieval strategy as an alternative to the single dense-retrieval pass, plus first-class evaluation for it (BEIR-style datasets and ad-hoc query CSVs). Additive — the standard retrieval path and outputs are unchanged; agentic mode reuses the existing Retriever/vector DB and is opt-in via --retrieval-mode agentic.

What's new

Agentic retrieval — ReActAgentOperator runs a per-query ReAct loop (issues retrieval sub-queries, accumulates candidates across steps, decides when to stop) → RRFAggregatorOperator fuses across steps (RRF, k=60) → SelectionAgentOperator does a final LLM selection, with a source-priority fallback chain (final_results → RRF → selection → candidate_ranking).
--evaluation-mode beir — score against a registered benchmark: vidore_hf (needs datasets) plus CSV/JSON loaders; recall@k / ndcg@k.
--evaluation-mode recall — score agentic retrieval against a query CSV (query + golden_answer), no dataset loader required (agentic-only; pdf_page/pdf_only).
CLI flags — --retrieval-mode, --agentic-llm-model, --agentic-invoke-url, --agentic-react-max-steps (50), --agentic-backend-top-k (20), --agentic-text-truncation (0 = none), --agentic-reasoning-effort (high), --agentic-num-concurrent (1), and --beir-loader/-dataset-name/-doc-id-field/-split/-query-language.
Docs & tests — README "Agentic retrieval evaluation" section; agentic/README.md; test_agentic_eval.py + test_agentic_operators.py.

Results — ViDoRe v3

Benchmarked against the reference agentic pipeline (retrieval-bench) under an
identical, controlled setup so the comparison isolates the retrieval
framework: same page-level image+text index (llama-nemotron-embed-vl-1b-v2
embedder), same agent LLM (llama-3.3-nemotron-super-49b-v1.5), same agent
settings (reasoning_effort=high, retriever pool depth 20, target top-k 10,
max 50 ReAct steps), full query sets. The retrieval substrate is shared, so the
numbers reflect the agent framework only.

Domain	recall@10 (ref / this PR)	nDCG@10 (ref / this PR)
computer_science	0.7431 / 0.7234	0.7396 / 0.7182
energy	0.6975 / 0.6612	0.6369 / 0.6274
finance_en	0.6406 / 0.6134	0.6109 / 0.5951
finance_fr	0.4750 / 0.4491	0.4182 / 0.4008
hr	0.5775 / 0.5583	0.5631 / 0.5523
industrial	0.4695 / 0.4636	0.4543 / 0.4615
pharmaceuticals	0.6724 / 0.6711	0.6449 / 0.6439
physics	0.4560 / 0.4353	0.4373 / 0.4133
Macro avg	0.5914 / 0.5719	0.5632 / 0.5516

The graph-operator implementation tracks the reference pipeline across all eight
domains on a shared substrate.

Scope

No changes to the standard retrieval path or shared modules; opt-in.
Metric/log format follows existing pipeline conventions.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-06-09T20:02:28Z

Greptile Summary

This PR adds an opt-in LLM-driven agentic retrieval mode (--retrieval-mode agentic) alongside first-class BEIR and CSV recall evaluation, without touching the existing standard retrieval path. The core pipeline is ReActAgentOperator (per-query ReAct loop) → RRFAggregatorOperator (cross-step RRF fusion) → SelectionAgentOperator (priority-based selection: final_results → RRF → selection_agent → candidate_ranking); notably, the LLM selection step is intentionally bypassed whenever RRF scores are present, which is the normal path.

nemo_retriever/src/nemo_retriever/agentic/retrieval.py — new AgenticRetrievalConfig + AgenticRetriever; wires the graph operators onto the existing Retriever; provides run_agentic_recall_evaluation and run_agentic_beir_evaluation entry points.
graph/react_agent_operator.py, rrf_aggregator_operator.py, selection_agent_operator.py — extended with has_valid_final_results/is_final_result columns, backend-depth cap, final-results validation, and a deterministic ordering fix for concurrent query batches.
pipeline/__main__.py — ingest stage refactored into structured IngestPlanRequest dataclasses; _build_ingestor narrowed to service-only; new --retrieval-mode / agentic CLI flags added; standard evaluation path unchanged.

Confidence Score: 5/5

The agentic retrieval path is entirely additive and opt-in; the standard retrieval, ingest, and VDB paths are unchanged.

The new operators are well-tested (happy path, timeout/no-final-results, rejection, RRF priority, concurrency ordering). The has_valid_final_results sentinel is correctly initialised as None rather than [], so the RRF fallback activates correctly when the agent exhausts steps without calling final_results. The two findings are logging style concerns that do not affect correctness.

nemo_retriever/src/nemo_retriever/pipeline/main.py — the unbounded INFO log of full qrels/run dicts in _run_agentic_evaluation is worth a quick look before running against large BEIR splits.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/agentic/retrieval.py	New module introducing AgenticRetrievalConfig (frozen dataclass), AgenticRetriever, and evaluation entry points; clean structure with good input validation and ID-mapping logic.
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py	Adds final_results validation, backend_top_k cap, seen-doc replay, deterministic concurrent ordering, and reasoning_effort forwarding. New has_valid_final_results/is_final_result columns correctly initialized from Optional[List] sentinel.
nemo_retriever/src/nemo_retriever/graph/selection_agent_operator.py	Adds _preferred_doc_ids priority chain (final_results → RRF → selection_agent → candidate_ranking); reasoning_effort forwarded; result_source column added. Priority chain intentionally bypasses LLM selection when RRF scores are present (well-tested).
nemo_retriever/src/nemo_retriever/graph/rrf_aggregator_operator.py	Passes has_valid_final_results and react_final_rank columns downstream; empty DataFrame updated to include new columns; logic is straightforward and correct.
nemo_retriever/src/nemo_retriever/pipeline/main.py	Significant ingest refactoring (flat args → IngestPlanRequest dataclasses) plus new agentic evaluation path. _run_agentic_evaluation logs full qrels/run dicts at INFO — potentially very large messages for production BEIR runs.
nemo_retriever/tests/test_agentic_eval.py	New test file covering AgenticRetrievalConfig validation, BEIR/recall evaluation paths, CLI flag wiring, and rejection of invalid mode combinations; mocks placed at the boundary covering both happy and error paths.
nemo_retriever/tests/test_agentic_operators.py	Expanded with tests for backend_top_k cap, final-results validation/rejection, RRF-vs-selection priority, no-final-results fallback, and result_source tracking; good coverage of new operator behaviors.
nemo_retriever/src/nemo_retriever/agentic/init.py	Exports AgenticRetrievalConfig, AgenticRetriever, and recall helpers; run_agentic_beir_evaluation is omitted from all and the import block (prior thread).

Sequence Diagram

sequenceDiagram
    participant CLI as retriever pipeline run
    participant AR as AgenticRetriever
    participant RAO as ReActAgentOperator
    participant LLM as LLM (OpenAI-compat)
    participant VDB as Retriever / VDB
    participant RRF as RRFAggregatorOperator
    participant SAO as SelectionAgentOperator

    CLI->>AR: retrieve(query_ids, query_texts)
    AR->>RAO: process(query_df)

    loop Per query (up to num_concurrent in parallel)
        RAO->>VDB: "_call_retriever(query, fetch_k <= backend_top_k)"
        VDB-->>RAO: hits (with seen-doc stubs)
        loop "ReAct steps (<= max_steps)"
            RAO->>LLM: chat(messages, tools)
            LLM-->>RAO: tool_call
            alt "tool == retrieve"
                RAO->>VDB: _call_retriever(sub_query)
                VDB-->>RAO: new hits
            else "tool == final_results (validated)"
                RAO-->>RAO: "set final_doc_ids, loop_done=True"
            else "tool == think"
                RAO-->>RAO: log thought
            end
        end
        RAO-->>RRF: rows with has_valid_final_results / is_final_result
    end

    RRF-->>SAO: RRF-ranked df with react_final_rank column

    SAO->>SAO: _preferred_doc_ids() priority chain
    alt has react_final_rank entries
        SAO-->>CLI: "result_source=final_results"
    else rrf_score present (normal path)
        SAO-->>CLI: "result_source=rrf"
    else no rrf_score
        SAO->>LLM: _select_documents
        LLM-->>SAO: selected doc_ids
        SAO-->>CLI: "result_source=selection_agent or candidate_ranking"
    end

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/pipeline/__main__.py:600-601
**Unbounded INFO log messages for full qrels/run data**

`_qrels` and `_run` are serialised into a single `logger.info` call with no per-message size cap. On a standard BEIR split with thousands of queries (each entry joined into one Python dict repr), this produces a single log record that can easily exceed 1 MB — beyond the hard limits of common log aggregators (Elasticsearch ~256 KB, CloudWatch ~256 KB) and therefore silently dropped. The document-processing security rule also advises against logging document identifiers at INFO level. Both lines should be moved to `logger.debug`, which carries no production overhead.

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py:497-498
**Empty-output DataFrame omits the new schema columns**

When `_run_single_query` produces no rows for every query in the batch, `process()` returns an early-exit DataFrame with six columns. The two columns added in this PR — `has_valid_final_results` and `is_final_result` — are absent from that schema. Downstream operators guard against this with `if "has_valid_final_results" in qgroup.columns` checks, so nothing breaks today, but the output schema is inconsistent with the non-empty path and any future operator that relies on those columns without a guard would silently misbehave.

```suggestion
        if not rows:
            return pd.DataFrame(
                columns=["query_id", "query_text", "step_idx", "doc_id", "text", "rank", "has_valid_final_results", "is_final_result"]
            )
```

_{Reviews (2): Last reviewed commit: "cleanup" | Re-trigger Greptile}

Signed-off-by: Mahika Wason <mwason@nvidia.com>

mahikaw changed the title ~~Agentic Retrieval integration into retriever pipeline~~ Adding Agentic Retrieval as a new retrieveral mode May 12, 2026

mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from 44daf00 to 4faa3c6 Compare June 9, 2026 16:34

mahikaw marked this pull request as ready for review June 9, 2026 19:47

mahikaw requested review from a team as code owners June 9, 2026 19:47

mahikaw requested a review from ChrisJar June 9, 2026 19:47

mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from ed278c7 to 054256a Compare June 9, 2026 19:56

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from 054256a to ce71d17 Compare June 9, 2026 20:02

mahikaw added 6 commits June 9, 2026 20:26

agentic retrieval init

59a1c59

Signed-off-by: Mahika Wason <mwason@nvidia.com>

param defaults updated

b915c46

Signed-off-by: Mahika Wason <mwason@nvidia.com>

adding Beir evaluation wiring and pinning down defaults

bf7fd58

Signed-off-by: Mahika Wason <mwason@nvidia.com>

cleanup

9aaca6a

Signed-off-by: Mahika Wason <mwason@nvidia.com>

cleanup

9e12581

Signed-off-by: Mahika Wason <mwason@nvidia.com>

added review fixes

8c0af28

Signed-off-by: Mahika Wason <mwason@nvidia.com>

mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from ce71d17 to 8c0af28 Compare June 9, 2026 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Agentic Retrieval as a new retrieveral mode#2018

Adding Agentic Retrieval as a new retrieveral mode#2018
mahikaw wants to merge 6 commits into
mainfrom
dev/mahikaw/agentic_retrieval

mahikaw commented May 12, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mahikaw commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Agentic retrieval mode + BEIR / query-CSV evaluation

Summary

What's new

Results — ViDoRe v3

Scope

Checklist

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mahikaw commented May 12, 2026 •

edited

Loading

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading