A lightweight semantic search pipeline for querying technical documentation using natural language. Demonstrated with the OpenRTB 2.6 programmatic advertising specification.
Built with Python · Milvus Lite · sentence-transformers · all-MiniLM-L6-v2 · pypdf
Dense technical documentation is everywhere (protocol specs, compliance policies, API references, onboarding guides) and keyword search consistently fails it. The right answer is buried in a paragraph that doesn't share a single word with your query.
This pipeline lets anyone ask questions in plain English and get back the exact passages that answer them:
- Solutions Engineers answering client integration questions without reading 100-page specs
- Trust & Safety teams searching policy documentation for specific scenarios or edge cases
- AdTech teams navigating protocol specs like OpenRTB, VAST, or SKAN
- Any team with internal documentation too dense to Ctrl+F effectively
No external API, no server, no infrastructure. Runs entirely on a laptop.
PDF → chunk (300 chars, 50 overlap) → embed (all-MiniLM-L6-v2) → Milvus Lite → cosine search- Ingest –
pypdfextracts text page by page; text is split into overlapping passages. - Embed –
pymilvus's built-inSentenceTransformerEmbeddingFunctionencodes every passage as a dense vector (no API key needed; model downloads once). - Store – Vectors and metadata are written to a local Milvus Lite
.dbfile. - Query – Your question is embedded with the same model and the top-K nearest passages are returned by cosine similarity.
Querying the OpenRTB 2.6 spec:
$ python -m src.search "What is a bid floor?"
--- Result 1 | score=0.6100 | OpenRTB-2-6_FINAL.pdf p.29 ---
bidfloor: Minimum bid for this impression expressed in CPM. Default 0.
...
$ python -m src.search "What fields are required in a bid request?"
--- Result 1 | score=0.6900 | OpenRTB-2-6_FINAL.pdf p.44 ---
The following attributes are required and must be present in every bid request...
$ python -m src.search "What is a private marketplace deal?"
--- Result 1 | score=0.6400 | OpenRTB-2-6_FINAL.pdf p.29 ---
Pmp object: A container for any private marketplace (PMP) deals applicable to
this impression for a programmatic guaranteed or private auction...Relevant results at scores above 0.60, without knowing the exact field name or page number upfront.
An eval harness (src/eval.py) scores retrieval accuracy across 10 test queries with expected-keyword matching. Running it against four chunking configurations on the OpenRTB 2.6 spec:
| Config | Chunks | Accuracy | Avg Score |
|---|---|---|---|
| 300 chars / 50 overlap | 843 | 100% | 0.6260 |
| 500 chars / 100 overlap | 530 | 100% | 0.5931 |
| 750 chars / 150 overlap | 364 | 100% | 0.5633 |
| 1000 chars / 200 overlap | 272 | 90% | 0.5372 |
Smaller chunks produced higher similarity scores and better accuracy. The 1000-char config was the only one to miss a query ("What is a bid floor?") because the relevant keywords were diluted in a larger passage. The pipeline defaults (300 chars, 50 overlap) reflect the best-performing configuration.
Reproduce with:
python -m src.eval ~/Downloads/OpenRTB-2-6_FINAL.pdfpython3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython -m src.pipeline ~/Downloads/openrtb2.6.pdfThis creates milvus_demo.db in the current directory. Run with --force to rebuild from scratch.
python -m src.search "What is the difference between a bid request and a bid response?"
python -m src.search "How are native ads represented in OpenRTB?"
python -m src.search "What fields are required in an impression object?"from src.pipeline import run_pipeline
from src.search import search, print_results
# Build index (only once)
run_pipeline("~/Downloads/openrtb2.6.pdf")
# Query
results = search("What is a seat ID?", top_k=5)
print_results(results)The demo uses OpenRTB 2.6, but the pipeline is document-agnostic. Point it at any PDF:
- Compliance and policy documents
- Internal API or SDK references
- Product specs or onboarding guides
- Legal contracts or regulatory filings
One command to index, one command to search.
- Hybrid search with BM25
- Multi-document indexing
- LLM answer generation with citations
- Semantic chunking experiments
src/
ingest.py # PDF → list[Passage]
embed.py # list[Passage] → list[vector]
search.py # query string → list[SearchResult]
pipeline.py # orchestrates ingest → embed → store
eval.py # retrieval quality evaluation harness
requirements.txt
- The first run downloads the
all-MiniLM-L6-v2model (~90 MB) from HuggingFace viasentence-transformers. - Milvus Lite stores everything in a single
.dbfile; no server process is needed. - The PDF is not stored in this repository. Point the pipeline at your own copy.