Semantic Document Search

A lightweight semantic search pipeline for querying technical documentation using natural language. Demonstrated with the OpenRTB 2.6 programmatic advertising specification.

Built with Python · Milvus Lite · sentence-transformers · all-MiniLM-L6-v2 · pypdf

Why This Matters

Dense technical documentation is everywhere (protocol specs, compliance policies, API references, onboarding guides) and keyword search consistently fails it. The right answer is buried in a paragraph that doesn't share a single word with your query.

This pipeline lets anyone ask questions in plain English and get back the exact passages that answer them:

Solutions Engineers answering client integration questions without reading 100-page specs
Trust & Safety teams searching policy documentation for specific scenarios or edge cases
AdTech teams navigating protocol specs like OpenRTB, VAST, or SKAN
Any team with internal documentation too dense to Ctrl+F effectively

No external API, no server, no infrastructure. Runs entirely on a laptop.

How it works

PDF → chunk (300 chars, 50 overlap) → embed (all-MiniLM-L6-v2) → Milvus Lite → cosine search

Ingest – pypdf extracts text page by page; text is split into overlapping passages.
Embed – pymilvus's built-in SentenceTransformerEmbeddingFunction encodes every passage as a dense vector (no API key needed; model downloads once).
Store – Vectors and metadata are written to a local Milvus Lite .db file.
Query – Your question is embedded with the same model and the top-K nearest passages are returned by cosine similarity.

Example Queries

Querying the OpenRTB 2.6 spec:

$ python -m src.search "What is a bid floor?"

--- Result 1 | score=0.6100 | OpenRTB-2-6_FINAL.pdf p.29 ---
bidfloor: Minimum bid for this impression expressed in CPM. Default 0.
...

$ python -m src.search "What fields are required in a bid request?"

--- Result 1 | score=0.6900 | OpenRTB-2-6_FINAL.pdf p.44 ---
The following attributes are required and must be present in every bid request...

$ python -m src.search "What is a private marketplace deal?"

--- Result 1 | score=0.6400 | OpenRTB-2-6_FINAL.pdf p.29 ---
Pmp object: A container for any private marketplace (PMP) deals applicable to
this impression for a programmatic guaranteed or private auction...

Relevant results at scores above 0.60, without knowing the exact field name or page number upfront.

Evaluation & Tuning

An eval harness (src/eval.py) scores retrieval accuracy across 10 test queries with expected-keyword matching. Running it against four chunking configurations on the OpenRTB 2.6 spec:

Config	Chunks	Accuracy	Avg Score
300 chars / 50 overlap	843	100%	0.6260
500 chars / 100 overlap	530	100%	0.5931
750 chars / 150 overlap	364	100%	0.5633
1000 chars / 200 overlap	272	90%	0.5372

Smaller chunks produced higher similarity scores and better accuracy. The 1000-char config was the only one to miss a query ("What is a bid floor?") because the relevant keywords were diluted in a larger passage. The pipeline defaults (300 chars, 50 overlap) reflect the best-performing configuration.

Reproduce with:

python -m src.eval ~/Downloads/OpenRTB-2-6_FINAL.pdf

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

1. Build the index

python -m src.pipeline ~/Downloads/openrtb2.6.pdf

This creates milvus_demo.db in the current directory. Run with --force to rebuild from scratch.

2. Search

python -m src.search "What is the difference between a bid request and a bid response?"
python -m src.search "How are native ads represented in OpenRTB?"
python -m src.search "What fields are required in an impression object?"

3. Use from Python

from src.pipeline import run_pipeline
from src.search import search, print_results

# Build index (only once)
run_pipeline("~/Downloads/openrtb2.6.pdf")

# Query
results = search("What is a seat ID?", top_k=5)
print_results(results)

Adapt to Your Documentation

The demo uses OpenRTB 2.6, but the pipeline is document-agnostic. Point it at any PDF:

Compliance and policy documents
Internal API or SDK references
Product specs or onboarding guides
Legal contracts or regulatory filings

One command to index, one command to search.

What's Next

Hybrid search with BM25
Multi-document indexing
LLM answer generation with citations
Semantic chunking experiments

Project layout

src/
  ingest.py    # PDF → list[Passage]
  embed.py     # list[Passage] → list[vector]
  search.py    # query string → list[SearchResult]
  pipeline.py  # orchestrates ingest → embed → store
  eval.py      # retrieval quality evaluation harness
requirements.txt

Notes

The first run downloads the all-MiniLM-L6-v2 model (~90 MB) from HuggingFace via sentence-transformers.
Milvus Lite stores everything in a single .db file; no server process is needed.
The PDF is not stored in this repository. Point the pipeline at your own copy.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Document Search

Why This Matters

How it works

Example Queries

Evaluation & Tuning

Setup

Usage

1. Build the index

2. Search

3. Use from Python

Adapt to Your Documentation

What's Next

Project layout

Notes

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Document Search

Why This Matters

How it works

Example Queries

Evaluation & Tuning

Setup

Usage

1. Build the index

2. Search

3. Use from Python

Adapt to Your Documentation

What's Next

Project layout

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages