Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 30 additions & 2 deletions skills/nemo-retriever/references/query.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,47 @@
# Query turn — the WHOLE workflow

## Filename fast path — try BEFORE `retriever query`

If the user's question literally contains a PDF basename from `./pdfs/` **including the `.pdf` extension** (stem ≥6 chars, case-insensitive), skip semantic search. Direct pdfium extraction on the named file is faster and avoids semantic-search misses — the right doc is given, and pages rank by query-token overlap.

```bash
<RETRIEVER_VENV>/bin/python <skill_dir>/scripts/filename_fast_path.py "<the user's question>"
```

`<skill_dir>` is the "Base directory for this skill" announced at load time. Stdout is one of:

- `NO_MATCH` — no literal basename in the query. Fall through to the standard `retriever query` workflow below.
- `NO_TEXT` — matched file is image-only / pdfium got no text. Also fall through.
- A JSON object with `"ranking"` followed by `---TOP_PAGE_TEXT---` and the top page's raw text — that's the fast-path hit. Write `./output.json` directly: copy the `"ranking"` entries verbatim into `ranked_retrieved`, synthesize `final_answer` from the printed `TOP_PAGE_TEXT` (exact number/name/date; one paragraph; honest "not in the retrieved pages" if the fact genuinely isn't there; no chart hedging needed — pdfium extracts text only). Then STOP. Fast-path total: 2 tool calls (this Bash + Write). Do NOT also call `retriever query` — it's mutually exclusive.

## Standard path: `retriever query`

```bash
<RETRIEVER_VENV>/bin/retriever query "<the user's question>" --top-k 10 --embed-model-name nvidia/llama-nemotron-embed-1b-v2 --rerank \
| tee /tmp/hits.json \
| <RETRIEVER_VENV>/bin/python -c "import json,sys; [print(f'rank={h.get(\"rank\",0)} page={h[\"page_number\"]} pdf={h[\"pdf_basename\"]} type={h.get(\"metadata\",{}).get(\"type\",\"?\")} text={h[\"text\"][:200]}') for h in json.load(sys.stdin)]"
| <RETRIEVER_VENV>/bin/python -c "import json,sys; [print(f'rank={h.get(\"rank\",0)} page={h[\"page_number\"]} pdf={h[\"pdf_basename\"]} type={h.get(\"metadata\",{}).get(\"type\",\"?\")}') for h in json.load(sys.stdin)]"
```

Run that **exactly** as a single pipeline — do not split it into `HITS=$(...)` + `echo "$HITS" | <RETRIEVER_VENV>/bin/python -c ...` (the assignment swallows stdout, the pipe sees nothing, you waste 3 bash calls recovering). Stdout is clean JSON (model-init logs are silenced at the CLI layer); leave stderr unredirected so real errors surface on the first call. The full JSON sits at `/tmp/hits.json` if you need to re-parse it (`<RETRIEVER_VENV>/bin/python -c "import json; print(json.load(open('/tmp/hits.json'))[6])"` for the rank-7 hit), but in the common case the summary above is all you need.
Run that **exactly** as a single pipeline — do not split it into `HITS=$(...)` + `echo "$HITS" | <RETRIEVER_VENV>/bin/python -c ...` (the assignment swallows stdout, the pipe sees nothing, you waste 3 bash calls recovering). Stdout is clean JSON (model-init logs are silenced at the CLI layer); leave stderr unredirected so real errors surface on the first call. The summary above lists only rank/page/pdf/type — to read hit text for synthesizing `final_answer`, parse `/tmp/hits.json` directly. The top hit's text is one one-liner away: `<RETRIEVER_VENV>/bin/python -c "import json; print(json.load(open('/tmp/hits.json'))[0]['text'])"` (or `[i]` for the rank-(i+1) hit). Fetch only what you need — pulling all 10 hits' text into context inflates cached prompt size on every subsequent turn.

That's your FIRST tool call on every query turn. Do not Read, Glob, Grep, or list PDFs before this — those duplicate what `retriever query` already did.

**No narration between tool calls.** Do not write "Let me search…", "I'll now analyze…", "The retriever returned…", or any other commentary. Every assistant token you emit between the `retriever query` Bash call and the `Write` of `./output.json` becomes input tokens (and cached input tokens) for every subsequent turn in this session — quadratic cost. Go straight from reading the summary to writing the JSON file. The only assistant text in a query turn should be the tool calls themselves.

Each hit has: `text`, `pdf_basename`, `page_number` (int, **1-indexed**: the first page of a PDF is page `1`), `pdf_page` (string composite key `"<basename>_<page_number>"` — not a number, don't use it as one), `_distance`, and `metadata` (JSON with `type` ∈ `text|table|chart|image`).

## Keyword/regex search across the corpus

If you need exact text matches that semantic `retriever query` may have skipped — e.g. "find every mention of 'mRNA-1273' across all PDFs" — use:

```bash
<RETRIEVER_VENV>/bin/python <skill_dir>/scripts/grep_corpus.py "<regex>" [--max-hits 50]
```

It scans the LanceDB table the retriever already built — no PDF re-extraction. Output is `<pdf>:p<page>:<type>: ...<snippet>...` per hit; `NO_MATCH` if nothing. Counts against the same "one optional follow-up call" budget as the targeted text-extract (mutually exclusive — pick one).

Don't reach for `pdftotext`, `pdftohtml`, or `pdfgrep` — they're system tools that aren't guaranteed installed on the user's machine. The retriever venv bundles pdfium and `lancedb`; `grep_corpus.py` and `retriever pdf stage page-elements --method pdfium` cover the same use cases without that dependency.

## Write `./output.json` directly from the hits

- `final_answer`: synthesize from the top hits' `text`. Include the exact number / name / date / row / column the question asks for, plus the source PDF and 0-indexed page. One paragraph. No restating the question, no hedging caveats. If the chunks talk *around* the fact but don't state it, run ONE `<RETRIEVER_VENV>/bin/retriever pdf stage page-elements ./pdfs --method pdfium --json-output-dir /tmp/pdf_text --compact-json` and `Read` `/tmp/pdf_text/<top_pdf>.pdf.pdf_extraction.json` for the rank-1 page (or rank-2 if rank-1 is metadata) — that almost always surfaces the exact figure. Then synthesize. **If after both calls the asked-for fact still isn't in the evidence, write `final_answer` that says so explicitly** — e.g. "The retrieved pages do not state [X] for [entity]; the closest content is [Y]." Do NOT invent, extrapolate, or generate plausible-sounding content from adjacent material. A confidently-wrong answer scores worse than an honest "not in the retrieved pages".
Expand Down
173 changes: 173 additions & 0 deletions skills/nemo-retriever/scripts/filename_fast_path.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
"""Query-turn filename fast path for the nemo-retriever skill.
Comment thread
edknv marked this conversation as resolved.

Reads `./pdfs/` from the current working directory. If the query string
literally contains any PDF basename **including the `.pdf` extension**
(stem ≥6 chars, case-insensitive), runs `retriever pdf stage page-elements`
on each matched file via pdfium, ranks pages by query-token frequency,
and emits a top-10 ranking + the top page's raw text.

Invoked from SKILL.md as:
<RETRIEVER_VENV>/bin/python <skill_dir>/scripts/filename_fast_path.py "$QUERY"

The retriever binary is resolved from sys.executable's directory, so the
script is portable across venvs.

Stdout protocol (exactly one of):
- `NO_MATCH\n` — no PDF basename in the query.
- `NO_TEXT\n` — matches found but extraction produced no
text on any page (image-only PDFs).
- `<JSON>\n---TOP_PAGE_TEXT---\n<text>` — JSON with a "ranking" list of
{doc_id, page_number, rank} (1-indexed
pages, up to 10), followed by the top-
ranked page's raw text (first 4000 chars).

Exit code is 0 in all three success outcomes; non-zero only on hard errors
(missing ./pdfs, page-elements subprocess failure, malformed sidecar JSON).
"""

from __future__ import annotations

import json
import os
import re
import subprocess
import sys

PDF_DIR = "./pdfs"
EXTRACT_OUT = "/tmp/pdf_text"
MIN_STEM_LEN = 6
TOP_K = 10
TOP_PAGE_TEXT_CHARS = 4000

STOPWORDS = frozenset(
"the a an of in on for to and or is are was were what which how when "
"where who why this that these those with by from as at be it its do "
"does did please could would should tell me you i we us our my".split()
)


def find_matches(query_lower: str, basenames: list[str]) -> list[str]:
"""Return PDF basenames whose full name (including the `.pdf` extension)
appears verbatim in the lowercased query. Skip stems shorter than MIN_STEM_LEN.
Requiring the extension avoids false positives on common English words that
happen to appear as PDF stems (e.g. `report.pdf`, `market.pdf`)."""
matches = []
for name in basenames:
stem, ext = os.path.splitext(name)
if ext.lower() != ".pdf" or len(stem) < MIN_STEM_LEN:
continue
if name.lower() in query_lower:
matches.append(name)
return matches


def extract_pages(retriever_bin: str, matches: list[str]) -> None:
"""Extract each matched PDF; log per-file failures and continue so a single
bad PDF doesn't block remaining matches."""
os.makedirs(EXTRACT_OUT, exist_ok=True)
for m in matches:
try:
subprocess.run(
[
retriever_bin,
"pdf",
"stage",
"page-elements",
f"{PDF_DIR}/{m}",
"--method",
"pdfium",
"--json-output-dir",
EXTRACT_OUT,
"--compact-json",
],
check=True,
)
except subprocess.CalledProcessError as exc:
print(f"WARN: page-elements failed on {m}: exit {exc.returncode}", file=sys.stderr)


def sidecar_path(pdf_name: str) -> str | None:
stem = os.path.splitext(pdf_name)[0]
candidates = (
f"{EXTRACT_OUT}/{pdf_name}.pdf_extraction.json",
f"{EXTRACT_OUT}/{stem}.pdf_extraction.json",
)
Comment thread
edknv marked this conversation as resolved.
for c in candidates:
if os.path.exists(c):
return c
return None


def page_records(sidecar: str) -> list[dict]:
with open(sidecar) as fh:
data = json.load(fh)
if isinstance(data, list):
return data
if isinstance(data, dict):
return data.get("pages") or data.get("documents") or []
return []


def page_text(rec: dict) -> str:
txt = rec.get("text") or rec.get("content") or ""
if not txt and isinstance(rec.get("primitives"), list):
txt = " ".join(p.get("text", "") for p in rec["primitives"] if isinstance(p, dict))
return txt or ""


def tokenize(query: str) -> list[str]:
return [t for t in re.split(r"[^a-z0-9]+", query.lower()) if t and t not in STOPWORDS and len(t) > 2]


def rank_pages(matches: list[str], toks: list[str]) -> list[tuple[int, int, str, str]]:
"""Return list of (score, page_number, doc_stem, text) sorted by
descending score, ascending page number."""
scored = []
for m in matches:
sidecar = sidecar_path(m)
if sidecar is None:
continue
stem = os.path.splitext(m)[0]
for rec in page_records(sidecar):
pn = rec.get("page_number") or rec.get("page") or 0
txt = page_text(rec)
score = sum(txt.lower().count(t) for t in toks)
if score > 0:
scored.append((score, pn, stem, txt))
scored.sort(key=lambda r: (-r[0], r[1]))
return scored


def main() -> int:
if len(sys.argv) != 2:
print(f"usage: {sys.argv[0]} <query>", file=sys.stderr)
return 2
query = sys.argv[1]
ql = query.lower()
retriever_bin = os.path.join(os.path.dirname(sys.executable), "retriever")

try:
basenames = sorted(p for p in os.listdir(PDF_DIR) if p.lower().endswith(".pdf"))
except (FileNotFoundError, PermissionError) as exc:
print(f"ERROR: cannot list {PDF_DIR}: {exc}", file=sys.stderr)
return 1
matches = find_matches(ql, basenames)
if not matches:
print("NO_MATCH")
return 0

extract_pages(retriever_bin, matches)
scored = rank_pages(matches, tokenize(ql))
if not scored:
print("NO_TEXT")
return 0

ranking = [{"doc_id": s[2], "page_number": s[1], "rank": i + 1} for i, s in enumerate(scored[:TOP_K])]
print(json.dumps({"ranking": ranking}))
print("---TOP_PAGE_TEXT---")
print(scored[0][3][:TOP_PAGE_TEXT_CHARS])
return 0


if __name__ == "__main__":
sys.exit(main())
Comment thread
edknv marked this conversation as resolved.
99 changes: 99 additions & 0 deletions skills/nemo-retriever/scripts/grep_corpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
"""Case-insensitive keyword/regex search over the corpus via the LanceDB index.

This script scans the already-built LanceDB table, so it returns matches
across every chunk `retriever ingest` indexed (text, table, chart, image
transcriptions where present) without re-reading any PDF.

Usage:
<RETRIEVER_VENV>/bin/python <skill_dir>/scripts/grep_corpus.py <pattern> \\
[--max-hits 50] [--lancedb-uri ./lancedb] [--table-name nemo-retriever]

`pattern` is a Python regex, case-insensitive. For a literal-string search,
just write the string — most identifier characters (`.`, `-`, `_`, digits,
letters) are unambiguous unless you include regex metacharacters
(`(`, `|`, `*`, `?`, `[`, `]`, `\\`, `^`, `$`).

Output (one line per hit; sorted by pdf_basename then page_number):
<pdf_basename>:p<page_number>:<type>: ...<snippet around match>...

Prints `NO_MATCH` on zero hits. Caps at `--max-hits` to keep the turn output
bounded; raise it if you really want more.
"""

from __future__ import annotations

import argparse
import json
import re
import sys


def main() -> int:
ap = argparse.ArgumentParser()
ap.add_argument("pattern", help="Python regex (case-insensitive)")
ap.add_argument("--max-hits", type=int, default=50)
ap.add_argument("--snippet-pad", type=int, default=60)
ap.add_argument("--lancedb-uri", default="./lancedb")
ap.add_argument("--table-name", default="nemo-retriever")
args = ap.parse_args()

try:
import lancedb
except ImportError:
print("ERROR: lancedb not importable. Run with <RETRIEVER_VENV>/bin/python.", file=sys.stderr)
return 1

try:
pat = re.compile(args.pattern, re.IGNORECASE)
except re.error as e:
print(f"ERROR: bad regex {args.pattern!r}: {e}", file=sys.stderr)
return 2

try:
db = lancedb.connect(args.lancedb_uri)
tbl = db.open_table(args.table_name)
except Exception as e:
print(f"ERROR: can't open lancedb table {args.table_name!r} at " f"{args.lancedb_uri!r}: {e}", file=sys.stderr)
return 1

rows = tbl.to_pandas()
Comment thread
edknv marked this conversation as resolved.
if "text" not in rows.columns:
print(f"ERROR: lancedb table has no 'text' column. columns={list(rows.columns)}", file=sys.stderr)
return 1

hits = []
for row in rows.itertuples(index=False):
text = getattr(row, "text", "") or ""
m = pat.search(text)
if not m:
continue
pdf = getattr(row, "pdf_basename", "?")
page = getattr(row, "page_number", "?")
meta_raw = getattr(row, "metadata", "") or ""
if isinstance(meta_raw, str):
try:
meta = json.loads(meta_raw) if meta_raw else {}
except json.JSONDecodeError:
meta = {}
elif isinstance(meta_raw, dict):
meta = meta_raw
else:
meta = {}
type_ = meta.get("type", "?")
start = max(0, m.start() - args.snippet_pad)
end = min(len(text), m.end() + args.snippet_pad)
snippet = text[start:end].replace("\n", " ")
hits.append((pdf, page, type_, snippet))

hits.sort(key=lambda h: (str(h[0]), int(h[1]) if isinstance(h[1], (int, float)) else 0))
for pdf, page, type_, snippet in hits[: args.max_hits]:
print(f"{pdf}:p{page}:{type_}: ...{snippet}...")
if not hits:
print("NO_MATCH")
elif len(hits) > args.max_hits:
print(f"... ({len(hits) - args.max_hits} more matches truncated; " f"raise --max-hits to see them)")
return 0


if __name__ == "__main__":
sys.exit(main())
Loading