Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions skills/nemo-retriever/BENCHMARK.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
- Skill: `nemo-retriever`
- Evaluation date: 2026-05-29
- NVSkills-Eval profile: `external`
- Overall verdict: FAIL
- Overall verdict: PASS
- Tier 3 live agent evaluation: not available in this report

## Agents Used
Expand Down Expand Up @@ -40,7 +40,7 @@ Tier 3 dimension rollup was not available in this report.

## Tier 1: Static Validation Summary

Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 20 total findings.
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 21 total findings.

Top findings:

Expand All @@ -52,14 +52,13 @@ Top findings:

## Tier 2: Deduplication Summary

Tier 2 validation reported findings. NVSkills-Eval ran 2 checks and found 1 total findings.
Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings.

Top findings:
Notable observations:

- HIGH DUPLICATE/duplicate: Duplicate content found across references/cli/query.md and references/pitfalls.md:
"## Common failure modes" in references/cli/query.md (lines 78-90)
vs "## Failure modes (expected, not errors)" in references/pitfalls.md (lines 18-26) (`references/cli/query.md:78`)
- Context Deduplication: Collected 9 file(s)
- Inter-Skill Deduplication: Parsed skill 'nemo-retriever': 276 char description

## Publication Recommendation

The skill should be reviewed before NVSkills-Eval publication. Skill owners should address the findings above and rerun NVSkills-Eval to refresh this benchmark.
The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change.
6 changes: 3 additions & 3 deletions skills/nemo-retriever/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,16 @@ If `command -v retriever` returns nothing, follow `references/install.md` to ins
| Turn type | Read this once | Then execute |
| :--- | :--- | :--- |
| **Setup turn** (first turn — `./lancedb/nv-ingest.lance` doesn't exist) | `references/setup.md` | Build the index |
| **Query turn** (every subsequent turn — user asks a question) | `references/query.md` | One `retriever query` call, then `Write` `./output.json` |
| **Query turn** (every subsequent turn — user asks a question) | `references/query.md` | One `retriever query` call, then synthesize the answer |
| Anything errored or returned empty | `references/pitfalls.md` | Apply the named recovery; do not improvise |

For the full `retriever ingest` / `retriever query` CLI specs, see `references/cli/ingest.md` and `references/cli/query.md`. You do not need these for routine turns — `<RETRIEVER_VENV>/bin/retriever <subcommand> --help` is faster.

## Hard limits (apply to every turn)

- **Setup turn**: build the index in one shell command (see `references/setup.md`). STOP after the index lands.
- **Query turn**: at most **2 Bash calls** — 1 `retriever query`, +1 optional targeted text-extract per `references/query.md`. Then `Write` `./output.json` and STOP.
- **No narration between tool calls.** Tokens you emit between calls become input + cached input for every later turn — quadratic cost. Go straight from reading the summary to writing the JSON file.
- **Query turn**: at most **2 Bash calls** — 1 `retriever query`, +1 optional targeted text-extract per `references/query.md`. Then STOP.
- **No narration between tool calls.** Tokens you emit between calls become input + cached input for every later turn — quadratic cost. Go straight from reading the summary to producing the answer.
- **Banned**: `TodoWrite`, Glob, Grep, `Read` of whole PDFs, re-running setup, spawning subagents, speculative "confirmation" calls.

Long query turns (5+ tool calls, 1M+ cache-read tokens) cost ~5× a disciplined turn and almost always still produce the wrong answer. **Answering partially beats timing out.**
9 changes: 4 additions & 5 deletions skills/nemo-retriever/references/pitfalls.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,11 @@ For an unlisted subcommand: `<RETRIEVER_VENV>/bin/retriever <subcommand> --help`

## Failure modes (expected, not errors)

- **First `ingest` takes ~60s+** — vLLM warmup. Expected.
- **First `query` takes ~10–15s** — embedder cold-start. Expected.
- **Empty result** — ingest didn't run. Use the fallback above.
- **Empty result** — ingest didn't run. **Use the fallback above** (don't re-ingest).
- **`Clamping num_partitions ...`** — informational on tiny corpora, not an error.
- **Low-relevance top hit on tiny corpus** — look at `_distance` *gaps* between hits, not absolute values.
- **Page-element-detection warnings during ingest** — non-fatal as long as the embedding step itself succeeds (and they're silenced by `--quiet` on a successful run).
- **Page-element-detection warnings during ingest** — non-fatal; silenced by `--quiet`.

For cold-start latencies, `Table not found` errors, and low-relevance diagnostics see `references/cli/query.md` and `references/cli/ingest.md`.

## You ran more than 2 Bash calls on a query turn

Expand Down
2 changes: 1 addition & 1 deletion skills/nemo-retriever/references/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Filename fast path — try BEFORE `retriever query`

If the user's question literally contains a PDF basename from `./pdfs/` (stem ≥6 chars, with or without `.pdf`, case-insensitive), skip semantic search. Direct pdfium extraction on the named file is faster and avoids semantic-search misses — the right doc is given, and pages rank by query-token overlap.
If the user's question literally contains a PDF basename from `./pdfs/` **including the `.pdf` extension** (stem ≥6 chars, case-insensitive), skip semantic search. Direct pdfium extraction on the named file is faster and avoids semantic-search misses — the right doc is given, and pages rank by query-token overlap.

```bash
<RETRIEVER_VENV>/bin/python <skill_dir>/scripts/filename_fast_path.py "<the user's question>"
Expand Down
69 changes: 44 additions & 25 deletions skills/nemo-retriever/scripts/filename_fast_path.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
"""Query-turn filename fast path for the nemo-retriever skill.
Comment thread
edknv marked this conversation as resolved.

Reads `./pdfs/` from the current working directory. If the query string
literally contains any PDF basename (with or without the `.pdf` extension,
stem ≥6 chars, case-insensitive), runs `retriever pdf stage page-elements`
literally contains any PDF basename **including the `.pdf` extension**
(stem ≥6 chars, case-insensitive), runs `retriever pdf stage page-elements`
on each matched file via pdfium, ranks pages by query-token frequency,
and emits a top-10 ranking + the top page's raw text.

Expand All @@ -21,8 +21,10 @@
pages, up to 10), followed by the top-
ranked page's raw text (first 4000 chars).

Exit code is 0 in all three success outcomes; non-zero only on hard errors
(missing ./pdfs, page-elements subprocess failure, malformed sidecar JSON).
Exit code is 0 in all three success outcomes; non-zero only when `./pdfs/` is
missing or unreadable. Per-file errors (extraction subprocess failure, malformed
sidecar JSON) log a warning to stderr and are skipped — if every match is bad,
the script falls through to `NO_TEXT`.
"""

from __future__ import annotations
Expand All @@ -47,43 +49,51 @@


def find_matches(query_lower: str, basenames: list[str]) -> list[str]:
"""Return PDF basenames whose name (with or without .pdf) appears verbatim
in the lowercased query. Skip stems shorter than MIN_STEM_LEN."""
"""Return PDF basenames whose full name (including the `.pdf` extension)
appears verbatim in the lowercased query. Skip stems shorter than MIN_STEM_LEN.
Requiring the extension avoids false positives on common English words that
happen to appear as PDF stems (e.g. `report.pdf`, `market.pdf`)."""
matches = []
for name in basenames:
stem, ext = os.path.splitext(name)
if ext.lower() != ".pdf" or len(stem) < MIN_STEM_LEN:
continue
if name.lower() in query_lower or stem.lower() in query_lower:
if name.lower() in query_lower:
matches.append(name)
return matches


def extract_pages(retriever_bin: str, matches: list[str]) -> None:
"""Extract each matched PDF; log per-file failures and continue so a single
bad PDF doesn't block remaining matches."""
os.makedirs(EXTRACT_OUT, exist_ok=True)
for m in matches:
subprocess.run(
[
retriever_bin,
"pdf",
"stage",
"page-elements",
f"{PDF_DIR}/{m}",
"--method",
"pdfium",
"--json-output-dir",
EXTRACT_OUT,
"--compact-json",
],
check=True,
)
try:
subprocess.run(
[
retriever_bin,
"pdf",
"stage",
"page-elements",
f"{PDF_DIR}/{m}",
"--method",
"pdfium",
"--json-output-dir",
EXTRACT_OUT,
"--compact-json",
],
check=True,
stdout=subprocess.DEVNULL,
)
Comment thread
edknv marked this conversation as resolved.
except subprocess.CalledProcessError as exc:
print(f"WARN: page-elements failed on {m}: exit {exc.returncode}", file=sys.stderr)


def sidecar_path(pdf_name: str) -> str | None:
stem = os.path.splitext(pdf_name)[0]
candidates = (
f"{EXTRACT_OUT}/{pdf_name}.pdf_extraction.json",
f"{EXTRACT_OUT}/{stem}.pdf.pdf_extraction.json",
f"{EXTRACT_OUT}/{stem}.pdf_extraction.json",
)
for c in candidates:
if os.path.exists(c):
Expand All @@ -92,7 +102,12 @@ def sidecar_path(pdf_name: str) -> str | None:


def page_records(sidecar: str) -> list[dict]:
data = json.load(open(sidecar))
try:
with open(sidecar) as fh:
data = json.load(fh)
except json.JSONDecodeError as exc:
print(f"ERROR: malformed JSON in sidecar {sidecar!r}: {exc}", file=sys.stderr)
return []
if isinstance(data, list):
return data
if isinstance(data, dict):
Expand Down Expand Up @@ -138,7 +153,11 @@ def main() -> int:
ql = query.lower()
retriever_bin = os.path.join(os.path.dirname(sys.executable), "retriever")

basenames = sorted(p for p in os.listdir(PDF_DIR) if p.lower().endswith(".pdf"))
try:
basenames = sorted(p for p in os.listdir(PDF_DIR) if p.lower().endswith(".pdf"))
except (FileNotFoundError, PermissionError) as exc:
print(f"ERROR: cannot list {PDF_DIR}: {exc}", file=sys.stderr)
return 1
matches = find_matches(ql, basenames)
if not matches:
print("NO_MATCH")
Expand Down
24 changes: 12 additions & 12 deletions skills/nemo-retriever/skill-card.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ NVIDIA <br>
### License/Terms of Use: <br>
Apache 2.0 <br>
## Use Case: <br>
Developers and engineers who need to search, index, or answer questions across PDF and document collections using RAG and vector search via the retriever CLI. <br>
Developers and engineers who need to search, index, or answer questions over collections of PDFs and documents using a local RAG/vector-search pipeline powered by the retriever CLI. <br>

### Deployment Geography for Use: <br>
Global <br>
Expand All @@ -20,22 +20,22 @@ Mitigation: Review and scan skill before deployment. <br>

## Reference(s): <br>
- [NeMo Retriever Library Documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/) <br>
- [Install Guide](references/install.md) <br>
- [Setup Guide](references/setup.md) <br>
- [Query Workflow](references/query.md) <br>
- [Pitfalls and Recovery](references/pitfalls.md) <br>
- [CLI: ingest](references/cli/ingest.md) <br>
- [CLI: query](references/cli/query.md) <br>
- [CLI reference: retriever ingest](references/cli/ingest.md) <br>
- [CLI reference: retriever query](references/cli/query.md) <br>
- [Installation guide](references/install.md) <br>
- [Query workflow](references/query.md) <br>
- [Setup guide](references/setup.md) <br>
- [Pitfalls and recovery](references/pitfalls.md) <br>


## Skill Output: <br>
**Output Type(s):** [Shell commands, JSON] <br>
**Output Format:** [JSON] <br>
**Output Type(s):** [Shell commands, JSON, Synthesized answers] <br>
**Output Format:** [Markdown with inline bash code blocks and JSON query results] <br>
**Output Parameters:** [1D] <br>
**Other Properties Related to Output:** [None] <br>
**Other Properties Related to Output:** [Query results are JSON arrays sorted by vector distance; final answers are synthesized from retrieved context] <br>

## Evaluation Tasks: <br>
NVSkills-Eval 3-Tier evaluation (external profile); Tier 1 static validation (9 checks, 20 findings), Tier 2 deduplication (2 checks, 1 finding). Tier 3 live agent evaluation not available in this report. <br>
Evaluated through NVSkills-Eval 3-Tier framework (profile: external). Tier 1: 9 static validation checks (21 findings, passed with observations). Tier 2: 2 deduplication checks (0 findings, passed). Overall verdict: PASS. <br>

## Evaluation Metrics Used: <br>
Reported benchmark dimensions: <br>
Expand All @@ -48,7 +48,7 @@ Reported benchmark dimensions: <br>


## Skill Version(s): <br>
3fa00d94 (source: git SHA, committed 2026-05-28) <br>
25.3.0-1014-gb7fdbb45 (source: git describe) <br>

## Ethical Considerations: <br>
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
Expand Down
Loading
Loading