Adding Agentic Retrieval as a new retrieveral mode#2018
Conversation
44daf00 to
4faa3c6
Compare
ed278c7 to
054256a
Compare
Greptile SummaryThis PR adds an opt-in LLM-driven agentic retrieval mode (
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/agentic/retrieval.py | New module introducing AgenticRetrievalConfig (frozen dataclass), AgenticRetriever, and evaluation entry points; clean structure with good input validation and ID-mapping logic. |
| nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py | Adds final_results validation, backend_top_k cap, seen-doc replay, deterministic concurrent ordering, and reasoning_effort forwarding. New has_valid_final_results/is_final_result columns correctly initialized from Optional[List] sentinel. |
| nemo_retriever/src/nemo_retriever/graph/selection_agent_operator.py | Adds _preferred_doc_ids priority chain (final_results → RRF → selection_agent → candidate_ranking); reasoning_effort forwarded; result_source column added. Priority chain intentionally bypasses LLM selection when RRF scores are present (well-tested). |
| nemo_retriever/src/nemo_retriever/graph/rrf_aggregator_operator.py | Passes has_valid_final_results and react_final_rank columns downstream; empty DataFrame updated to include new columns; logic is straightforward and correct. |
| nemo_retriever/src/nemo_retriever/pipeline/main.py | Significant ingest refactoring (flat args → IngestPlanRequest dataclasses) plus new agentic evaluation path. _run_agentic_evaluation logs full qrels/run dicts at INFO — potentially very large messages for production BEIR runs. |
| nemo_retriever/tests/test_agentic_eval.py | New test file covering AgenticRetrievalConfig validation, BEIR/recall evaluation paths, CLI flag wiring, and rejection of invalid mode combinations; mocks placed at the boundary covering both happy and error paths. |
| nemo_retriever/tests/test_agentic_operators.py | Expanded with tests for backend_top_k cap, final-results validation/rejection, RRF-vs-selection priority, no-final-results fallback, and result_source tracking; good coverage of new operator behaviors. |
| nemo_retriever/src/nemo_retriever/agentic/init.py | Exports AgenticRetrievalConfig, AgenticRetriever, and recall helpers; run_agentic_beir_evaluation is omitted from all and the import block (prior thread). |
Sequence Diagram
sequenceDiagram
participant CLI as retriever pipeline run
participant AR as AgenticRetriever
participant RAO as ReActAgentOperator
participant LLM as LLM (OpenAI-compat)
participant VDB as Retriever / VDB
participant RRF as RRFAggregatorOperator
participant SAO as SelectionAgentOperator
CLI->>AR: retrieve(query_ids, query_texts)
AR->>RAO: process(query_df)
loop Per query (up to num_concurrent in parallel)
RAO->>VDB: "_call_retriever(query, fetch_k <= backend_top_k)"
VDB-->>RAO: hits (with seen-doc stubs)
loop "ReAct steps (<= max_steps)"
RAO->>LLM: chat(messages, tools)
LLM-->>RAO: tool_call
alt "tool == retrieve"
RAO->>VDB: _call_retriever(sub_query)
VDB-->>RAO: new hits
else "tool == final_results (validated)"
RAO-->>RAO: "set final_doc_ids, loop_done=True"
else "tool == think"
RAO-->>RAO: log thought
end
end
RAO-->>RRF: rows with has_valid_final_results / is_final_result
end
RRF-->>SAO: RRF-ranked df with react_final_rank column
SAO->>SAO: _preferred_doc_ids() priority chain
alt has react_final_rank entries
SAO-->>CLI: "result_source=final_results"
else rrf_score present (normal path)
SAO-->>CLI: "result_source=rrf"
else no rrf_score
SAO->>LLM: _select_documents
LLM-->>SAO: selected doc_ids
SAO-->>CLI: "result_source=selection_agent or candidate_ranking"
end
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/pipeline/__main__.py:600-601
**Unbounded INFO log messages for full qrels/run data**
`_qrels` and `_run` are serialised into a single `logger.info` call with no per-message size cap. On a standard BEIR split with thousands of queries (each entry joined into one Python dict repr), this produces a single log record that can easily exceed 1 MB — beyond the hard limits of common log aggregators (Elasticsearch ~256 KB, CloudWatch ~256 KB) and therefore silently dropped. The document-processing security rule also advises against logging document identifiers at INFO level. Both lines should be moved to `logger.debug`, which carries no production overhead.
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py:497-498
**Empty-output DataFrame omits the new schema columns**
When `_run_single_query` produces no rows for every query in the batch, `process()` returns an early-exit DataFrame with six columns. The two columns added in this PR — `has_valid_final_results` and `is_final_result` — are absent from that schema. Downstream operators guard against this with `if "has_valid_final_results" in qgroup.columns` checks, so nothing breaks today, but the output schema is inconsistent with the non-empty path and any future operator that relies on those columns without a guard would silently misbehave.
```suggestion
if not rows:
return pd.DataFrame(
columns=["query_id", "query_text", "step_idx", "doc_id", "text", "rank", "has_valid_final_results", "is_final_result"]
)
```
Reviews (2): Last reviewed commit: "cleanup" | Re-trigger Greptile
054256a to
ce71d17
Compare
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
ce71d17 to
8c0af28
Compare
Description
Agentic retrieval mode + BEIR / query-CSV evaluation
Summary
Adds an LLM-driven agentic retrieval strategy as an alternative to the single dense-retrieval pass, plus first-class evaluation for it (BEIR-style datasets and ad-hoc query CSVs). Additive — the standard retrieval path and outputs are unchanged; agentic mode reuses the existing
Retriever/vector DB and is opt-in via--retrieval-mode agentic.What's new
ReActAgentOperatorruns a per-query ReAct loop (issues retrieval sub-queries, accumulates candidates across steps, decides when to stop) →RRFAggregatorOperatorfuses across steps (RRF, k=60) →SelectionAgentOperatordoes a final LLM selection, with a source-priority fallback chain (final_results → RRF → selection → candidate_ranking).--evaluation-mode beir— score against a registered benchmark:vidore_hf(needsdatasets) plus CSV/JSON loaders;recall@k/ndcg@k.--evaluation-mode recall— score agentic retrieval against a query CSV (query+golden_answer), no dataset loader required (agentic-only;pdf_page/pdf_only).--retrieval-mode,--agentic-llm-model,--agentic-invoke-url,--agentic-react-max-steps(50),--agentic-backend-top-k(20),--agentic-text-truncation(0 = none),--agentic-reasoning-effort(high),--agentic-num-concurrent(1), and--beir-loader/-dataset-name/-doc-id-field/-split/-query-language.agentic/README.md;test_agentic_eval.py+test_agentic_operators.py.Results — ViDoRe v3
Benchmarked against the reference agentic pipeline (
retrieval-bench) under anidentical, controlled setup so the comparison isolates the retrieval
framework: same page-level image+text index (
llama-nemotron-embed-vl-1b-v2embedder), same agent LLM (
llama-3.3-nemotron-super-49b-v1.5), same agentsettings (
reasoning_effort=high, retriever pool depth 20, target top-k 10,max 50 ReAct steps), full query sets. The retrieval substrate is shared, so the
numbers reflect the agent framework only.
The graph-operator implementation tracks the reference pipeline across all eight
domains on a shared substrate.
Scope
Checklist