diff --git a/skills/nemo-retriever/references/query.md b/skills/nemo-retriever/references/query.md index eeed6a25a6..669c5da4f3 100644 --- a/skills/nemo-retriever/references/query.md +++ b/skills/nemo-retriever/references/query.md @@ -1,12 +1,28 @@ # Query turn — the WHOLE workflow +## Filename fast path — try BEFORE `retriever query` + +If the user's question literally contains a PDF basename from `./pdfs/` **including the `.pdf` extension** (stem ≥6 chars, case-insensitive), skip semantic search. Direct pdfium extraction on the named file is faster and avoids semantic-search misses — the right doc is given, and pages rank by query-token overlap. + +```bash +/bin/python /scripts/filename_fast_path.py "" +``` + +`` is the "Base directory for this skill" announced at load time. Stdout is one of: + +- `NO_MATCH` — no literal basename in the query. Fall through to the standard `retriever query` workflow below. +- `NO_TEXT` — matched file is image-only / pdfium got no text. Also fall through. +- A JSON object with `"ranking"` followed by `---TOP_PAGE_TEXT---` and the top page's raw text — that's the fast-path hit. Write `./output.json` directly: copy the `"ranking"` entries verbatim into `ranked_retrieved`, synthesize `final_answer` from the printed `TOP_PAGE_TEXT` (exact number/name/date; one paragraph; honest "not in the retrieved pages" if the fact genuinely isn't there; no chart hedging needed — pdfium extracts text only). Then STOP. Fast-path total: 2 tool calls (this Bash + Write). Do NOT also call `retriever query` — it's mutually exclusive. + +## Standard path: `retriever query` + ```bash /bin/retriever query "" --top-k 10 --embed-model-name nvidia/llama-nemotron-embed-1b-v2 --rerank \ | tee /tmp/hits.json \ - | /bin/python -c "import json,sys; [print(f'rank={h.get(\"rank\",0)} page={h[\"page_number\"]} pdf={h[\"pdf_basename\"]} type={h.get(\"metadata\",{}).get(\"type\",\"?\")} text={h[\"text\"][:200]}') for h in json.load(sys.stdin)]" + | /bin/python -c "import json,sys; [print(f'rank={h.get(\"rank\",0)} page={h[\"page_number\"]} pdf={h[\"pdf_basename\"]} type={h.get(\"metadata\",{}).get(\"type\",\"?\")}') for h in json.load(sys.stdin)]" ``` -Run that **exactly** as a single pipeline — do not split it into `HITS=$(...)` + `echo "$HITS" | /bin/python -c ...` (the assignment swallows stdout, the pipe sees nothing, you waste 3 bash calls recovering). Stdout is clean JSON (model-init logs are silenced at the CLI layer); leave stderr unredirected so real errors surface on the first call. The full JSON sits at `/tmp/hits.json` if you need to re-parse it (`/bin/python -c "import json; print(json.load(open('/tmp/hits.json'))[6])"` for the rank-7 hit), but in the common case the summary above is all you need. +Run that **exactly** as a single pipeline — do not split it into `HITS=$(...)` + `echo "$HITS" | /bin/python -c ...` (the assignment swallows stdout, the pipe sees nothing, you waste 3 bash calls recovering). Stdout is clean JSON (model-init logs are silenced at the CLI layer); leave stderr unredirected so real errors surface on the first call. The summary above lists only rank/page/pdf/type — to read hit text for synthesizing `final_answer`, parse `/tmp/hits.json` directly. The top hit's text is one one-liner away: `/bin/python -c "import json; print(json.load(open('/tmp/hits.json'))[0]['text'])"` (or `[i]` for the rank-(i+1) hit). Fetch only what you need — pulling all 10 hits' text into context inflates cached prompt size on every subsequent turn. That's your FIRST tool call on every query turn. Do not Read, Glob, Grep, or list PDFs before this — those duplicate what `retriever query` already did. @@ -14,6 +30,18 @@ That's your FIRST tool call on every query turn. Do not Read, Glob, Grep, or lis Each hit has: `text`, `pdf_basename`, `page_number` (int, **1-indexed**: the first page of a PDF is page `1`), `pdf_page` (string composite key `"_"` — not a number, don't use it as one), `_distance`, and `metadata` (JSON with `type` ∈ `text|table|chart|image`). +## Keyword/regex search across the corpus + +If you need exact text matches that semantic `retriever query` may have skipped — e.g. "find every mention of 'mRNA-1273' across all PDFs" — use: + +```bash +/bin/python /scripts/grep_corpus.py "" [--max-hits 50] +``` + +It scans the LanceDB table the retriever already built — no PDF re-extraction. Output is `:p:: ......` per hit; `NO_MATCH` if nothing. Counts against the same "one optional follow-up call" budget as the targeted text-extract (mutually exclusive — pick one). + +Don't reach for `pdftotext`, `pdftohtml`, or `pdfgrep` — they're system tools that aren't guaranteed installed on the user's machine. The retriever venv bundles pdfium and `lancedb`; `grep_corpus.py` and `retriever pdf stage page-elements --method pdfium` cover the same use cases without that dependency. + ## Write `./output.json` directly from the hits - `final_answer`: synthesize from the top hits' `text`. Include the exact number / name / date / row / column the question asks for, plus the source PDF and 0-indexed page. One paragraph. No restating the question, no hedging caveats. If the chunks talk *around* the fact but don't state it, run ONE `/bin/retriever pdf stage page-elements ./pdfs --method pdfium --json-output-dir /tmp/pdf_text --compact-json` and `Read` `/tmp/pdf_text/.pdf.pdf_extraction.json` for the rank-1 page (or rank-2 if rank-1 is metadata) — that almost always surfaces the exact figure. Then synthesize. **If after both calls the asked-for fact still isn't in the evidence, write `final_answer` that says so explicitly** — e.g. "The retrieved pages do not state [X] for [entity]; the closest content is [Y]." Do NOT invent, extrapolate, or generate plausible-sounding content from adjacent material. A confidently-wrong answer scores worse than an honest "not in the retrieved pages". diff --git a/skills/nemo-retriever/scripts/filename_fast_path.py b/skills/nemo-retriever/scripts/filename_fast_path.py new file mode 100644 index 0000000000..33243912b3 --- /dev/null +++ b/skills/nemo-retriever/scripts/filename_fast_path.py @@ -0,0 +1,173 @@ +"""Query-turn filename fast path for the nemo-retriever skill. + +Reads `./pdfs/` from the current working directory. If the query string +literally contains any PDF basename **including the `.pdf` extension** +(stem ≥6 chars, case-insensitive), runs `retriever pdf stage page-elements` +on each matched file via pdfium, ranks pages by query-token frequency, +and emits a top-10 ranking + the top page's raw text. + +Invoked from SKILL.md as: + /bin/python /scripts/filename_fast_path.py "$QUERY" + +The retriever binary is resolved from sys.executable's directory, so the +script is portable across venvs. + +Stdout protocol (exactly one of): +- `NO_MATCH\n` — no PDF basename in the query. +- `NO_TEXT\n` — matches found but extraction produced no + text on any page (image-only PDFs). +- `\n---TOP_PAGE_TEXT---\n` — JSON with a "ranking" list of + {doc_id, page_number, rank} (1-indexed + pages, up to 10), followed by the top- + ranked page's raw text (first 4000 chars). + +Exit code is 0 in all three success outcomes; non-zero only on hard errors +(missing ./pdfs, page-elements subprocess failure, malformed sidecar JSON). +""" + +from __future__ import annotations + +import json +import os +import re +import subprocess +import sys + +PDF_DIR = "./pdfs" +EXTRACT_OUT = "/tmp/pdf_text" +MIN_STEM_LEN = 6 +TOP_K = 10 +TOP_PAGE_TEXT_CHARS = 4000 + +STOPWORDS = frozenset( + "the a an of in on for to and or is are was were what which how when " + "where who why this that these those with by from as at be it its do " + "does did please could would should tell me you i we us our my".split() +) + + +def find_matches(query_lower: str, basenames: list[str]) -> list[str]: + """Return PDF basenames whose full name (including the `.pdf` extension) + appears verbatim in the lowercased query. Skip stems shorter than MIN_STEM_LEN. + Requiring the extension avoids false positives on common English words that + happen to appear as PDF stems (e.g. `report.pdf`, `market.pdf`).""" + matches = [] + for name in basenames: + stem, ext = os.path.splitext(name) + if ext.lower() != ".pdf" or len(stem) < MIN_STEM_LEN: + continue + if name.lower() in query_lower: + matches.append(name) + return matches + + +def extract_pages(retriever_bin: str, matches: list[str]) -> None: + """Extract each matched PDF; log per-file failures and continue so a single + bad PDF doesn't block remaining matches.""" + os.makedirs(EXTRACT_OUT, exist_ok=True) + for m in matches: + try: + subprocess.run( + [ + retriever_bin, + "pdf", + "stage", + "page-elements", + f"{PDF_DIR}/{m}", + "--method", + "pdfium", + "--json-output-dir", + EXTRACT_OUT, + "--compact-json", + ], + check=True, + ) + except subprocess.CalledProcessError as exc: + print(f"WARN: page-elements failed on {m}: exit {exc.returncode}", file=sys.stderr) + + +def sidecar_path(pdf_name: str) -> str | None: + stem = os.path.splitext(pdf_name)[0] + candidates = ( + f"{EXTRACT_OUT}/{pdf_name}.pdf_extraction.json", + f"{EXTRACT_OUT}/{stem}.pdf_extraction.json", + ) + for c in candidates: + if os.path.exists(c): + return c + return None + + +def page_records(sidecar: str) -> list[dict]: + with open(sidecar) as fh: + data = json.load(fh) + if isinstance(data, list): + return data + if isinstance(data, dict): + return data.get("pages") or data.get("documents") or [] + return [] + + +def page_text(rec: dict) -> str: + txt = rec.get("text") or rec.get("content") or "" + if not txt and isinstance(rec.get("primitives"), list): + txt = " ".join(p.get("text", "") for p in rec["primitives"] if isinstance(p, dict)) + return txt or "" + + +def tokenize(query: str) -> list[str]: + return [t for t in re.split(r"[^a-z0-9]+", query.lower()) if t and t not in STOPWORDS and len(t) > 2] + + +def rank_pages(matches: list[str], toks: list[str]) -> list[tuple[int, int, str, str]]: + """Return list of (score, page_number, doc_stem, text) sorted by + descending score, ascending page number.""" + scored = [] + for m in matches: + sidecar = sidecar_path(m) + if sidecar is None: + continue + stem = os.path.splitext(m)[0] + for rec in page_records(sidecar): + pn = rec.get("page_number") or rec.get("page") or 0 + txt = page_text(rec) + score = sum(txt.lower().count(t) for t in toks) + if score > 0: + scored.append((score, pn, stem, txt)) + scored.sort(key=lambda r: (-r[0], r[1])) + return scored + + +def main() -> int: + if len(sys.argv) != 2: + print(f"usage: {sys.argv[0]} ", file=sys.stderr) + return 2 + query = sys.argv[1] + ql = query.lower() + retriever_bin = os.path.join(os.path.dirname(sys.executable), "retriever") + + try: + basenames = sorted(p for p in os.listdir(PDF_DIR) if p.lower().endswith(".pdf")) + except (FileNotFoundError, PermissionError) as exc: + print(f"ERROR: cannot list {PDF_DIR}: {exc}", file=sys.stderr) + return 1 + matches = find_matches(ql, basenames) + if not matches: + print("NO_MATCH") + return 0 + + extract_pages(retriever_bin, matches) + scored = rank_pages(matches, tokenize(ql)) + if not scored: + print("NO_TEXT") + return 0 + + ranking = [{"doc_id": s[2], "page_number": s[1], "rank": i + 1} for i, s in enumerate(scored[:TOP_K])] + print(json.dumps({"ranking": ranking})) + print("---TOP_PAGE_TEXT---") + print(scored[0][3][:TOP_PAGE_TEXT_CHARS]) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/skills/nemo-retriever/scripts/grep_corpus.py b/skills/nemo-retriever/scripts/grep_corpus.py new file mode 100644 index 0000000000..1471b6e4c0 --- /dev/null +++ b/skills/nemo-retriever/scripts/grep_corpus.py @@ -0,0 +1,99 @@ +"""Case-insensitive keyword/regex search over the corpus via the LanceDB index. + +This script scans the already-built LanceDB table, so it returns matches +across every chunk `retriever ingest` indexed (text, table, chart, image +transcriptions where present) without re-reading any PDF. + +Usage: + /bin/python /scripts/grep_corpus.py \\ + [--max-hits 50] [--lancedb-uri ./lancedb] [--table-name nemo-retriever] + +`pattern` is a Python regex, case-insensitive. For a literal-string search, +just write the string — most identifier characters (`.`, `-`, `_`, digits, +letters) are unambiguous unless you include regex metacharacters +(`(`, `|`, `*`, `?`, `[`, `]`, `\\`, `^`, `$`). + +Output (one line per hit; sorted by pdf_basename then page_number): + :p:: ...... + +Prints `NO_MATCH` on zero hits. Caps at `--max-hits` to keep the turn output +bounded; raise it if you really want more. +""" + +from __future__ import annotations + +import argparse +import json +import re +import sys + + +def main() -> int: + ap = argparse.ArgumentParser() + ap.add_argument("pattern", help="Python regex (case-insensitive)") + ap.add_argument("--max-hits", type=int, default=50) + ap.add_argument("--snippet-pad", type=int, default=60) + ap.add_argument("--lancedb-uri", default="./lancedb") + ap.add_argument("--table-name", default="nemo-retriever") + args = ap.parse_args() + + try: + import lancedb + except ImportError: + print("ERROR: lancedb not importable. Run with /bin/python.", file=sys.stderr) + return 1 + + try: + pat = re.compile(args.pattern, re.IGNORECASE) + except re.error as e: + print(f"ERROR: bad regex {args.pattern!r}: {e}", file=sys.stderr) + return 2 + + try: + db = lancedb.connect(args.lancedb_uri) + tbl = db.open_table(args.table_name) + except Exception as e: + print(f"ERROR: can't open lancedb table {args.table_name!r} at " f"{args.lancedb_uri!r}: {e}", file=sys.stderr) + return 1 + + rows = tbl.to_pandas() + if "text" not in rows.columns: + print(f"ERROR: lancedb table has no 'text' column. columns={list(rows.columns)}", file=sys.stderr) + return 1 + + hits = [] + for row in rows.itertuples(index=False): + text = getattr(row, "text", "") or "" + m = pat.search(text) + if not m: + continue + pdf = getattr(row, "pdf_basename", "?") + page = getattr(row, "page_number", "?") + meta_raw = getattr(row, "metadata", "") or "" + if isinstance(meta_raw, str): + try: + meta = json.loads(meta_raw) if meta_raw else {} + except json.JSONDecodeError: + meta = {} + elif isinstance(meta_raw, dict): + meta = meta_raw + else: + meta = {} + type_ = meta.get("type", "?") + start = max(0, m.start() - args.snippet_pad) + end = min(len(text), m.end() + args.snippet_pad) + snippet = text[start:end].replace("\n", " ") + hits.append((pdf, page, type_, snippet)) + + hits.sort(key=lambda h: (str(h[0]), int(h[1]) if isinstance(h[1], (int, float)) else 0)) + for pdf, page, type_, snippet in hits[: args.max_hits]: + print(f"{pdf}:p{page}:{type_}: ...{snippet}...") + if not hits: + print("NO_MATCH") + elif len(hits) > args.max_hits: + print(f"... ({len(hits) - args.max_hits} more matches truncated; " f"raise --max-hits to see them)") + return 0 + + +if __name__ == "__main__": + sys.exit(main())