Talk to your customer research interviews. Agentic search with citations, eval-driven from day one.
Ingests a folder of customer-research interview transcripts and lets you ask questions across all of them — getting answers backed by verbatim quotes with citations.
> What's the top reason trial users are churning?
Top reasons trial users churn are too many onboarding setup steps
(Sarah, Marcus, Tom), pricing felt steep for small teams (Marcus,
Yuki), and a poor mobile app experience (Priya).
Citations:
• [02_marcus] "I sat down on a Tuesday night to get everyone set up
and an hour later I was still in settings..."
• [03_priya] "Half the things I can do on web I can't do on mobile..."
• [01_sarah] "By the time I got to the third or fourth screen I kind
of just wanted to send a link to someone..."
The problem. Companies run hundreds of hours of customer-research interviews. Those transcripts live in Drive folders. Nobody re-reads them. The insights die there.
The product. Verbatim is an agentic-search tool that turns that graveyard into a conversation. Ask it anything about the corpus — "who churned because of pricing," "compare what power users love vs trial users hated," "find me three quotes about the mobile app" — and it returns sourced answers with verbatim citations.
The shape. Four layers:
- Chunker — speaker-turn chunking with surrounding-turn context windows
- Vector store — Voyage embeddings in a local Chroma index
- Agent — Claude Sonnet with four tools (
search_interviews,get_full_transcript,find_quotes,list_interviews) - Eval harness — twelve ground-truth questions, scored on precision, recall, F1, and LLM-as-judge faithfulness
It's also exposed as an MCP server, so Claude Desktop can connect directly and query the corpus.
A few choices worth flagging up front, since they're the things a reviewer should ask about:
Most AI demos are built first and maybe evaluated later. This one was evaluated first. Before any retrieval pipeline existed, the eval harness was built and verified against a deliberately-broken stub. The stub had planted recall misses and precision misses; the harness caught them in exactly the right way. Then I built the real pipelines, with the harness already in place to measure each iteration honestly.
This is the unintuitive but important move. Without evals, "the new version feels better" is the best you can do. With evals, every change to chunking, embedding, prompts, or tools is a measurable delta.
Real customer-research transcripts at scale aren't publicly available. The few academic datasets (Princeton gig-worker corpus, MediaSum) are either tiny or wrong-domain. So the corpus is synthetic.
This is better, not worse: because I generated it, I know the ground truth for every claim. Each interview has planted facts (e.g. "F5: the Salesforce integration is broken, mentioned by interviews 06/08/09"). The eval set tests against those exact expected answers. Trustworthy evals beat real data with mushy ground truth.
I built naive RAG first as a baseline — retrieve top-K chunks, stuff into a prompt, generate. The agent isn't just a fancier version; it's a deliberate response to naive RAG's known failure modes:
- Naive RAG is bounded by K. If a multi-interview question has five right answers but the top-K chunks come from three interviews, it misses the other two.
- The agent can do follow-up searches when a first pass is incomplete.
- It has a
list_interviewstool that gives it a corpus map, so it can pick targets instead of blind-searching.
Whether the agent actually wins is an empirical question, which the eval
harness answers. (See evals/runs/ for the numbers.)
Most RAG tutorials chunk by fixed character count. That splits insights mid-thought and dilutes retrieval. Interview transcripts have a natural unit — the speaker turn. Each turn becomes one chunk, and each chunk includes the previous and next turn as context. This matters because interviewer questions often contain keywords ("Why did you cancel?") that the participant doesn't repeat in the answer.
The agent was evaluated against a 12-question ground-truth set covering single-fact retrieval, multi-interview synthesis, and quote-finding. Each question was scored on precision, recall, F1, and (where applicable) LLM-as-judge faithfulness. Naive RAG was run as a baseline on the same questions.
| Pipeline | Precision | Recall | F1 | Faithfulness |
|---|---|---|---|---|
| naive_rag | 84.5% | 89.3% | 85.5% | 100% (4/4) |
| agent_v1 | 87.6% | 95.1% | 90.6% | 100% (4/4) |
| Δ | +3.1pp | +5.8pp | +5.1pp | tie |
The agent beats naive RAG by 5.1 points on F1, with no faithfulness regressions.
Per-question breakdown (F1):
| Question type | naive_rag | agent_v1 | Δ |
|---|---|---|---|
| Single-fact retrieval (Q5–Q9) | 100% | 100% | tie |
| Multi-interview synthesis (Q10, Q11) | 45.8% avg | 90% avg | +44pp |
| Comparison questions (Q2, Q3) | 82.8% avg | 100% | +17pp |
| Open-ended (Q1, Q4, Q12) | 89.6% avg | 69.0% avg | -20pp |
The agent's gains concentrate where naive RAG's weakness is structural: multi-interview retrieval bounded by K. On single-fact questions both pipelines tie — there's only one relevant passage, so a single retrieval finds it. On open-ended questions the agent occasionally goes too broad, trading precision for recall. That's a known failure mode worth addressing (see "What's next").
Full per-question results in evals/runs/agent_v1_20260522_175027.json and
evals/runs/naive_rag_20260522_175602.json. To reproduce:
make eval-compare# 1. Create and activate a venv
python3 -m venv .venv
source .venv/bin/activate
# 2. Install deps
pip install -e .
# 3. Configure API keys
cp .env.example .env
# Then edit .env and paste in your ANTHROPIC_API_KEY and VOYAGE_API_KEY.
# 4. Build the vector index (embeds all 10 interviews; takes ~5 seconds)
make ingestYou need two API keys:
- Anthropic (https://console.anthropic.com) — Claude is the LLM that answers questions and judges faithfulness. Add $5 in credits at Settings → Billing. Hard ceiling, no auto-charge. This project uses ~$3 total.
- Voyage AI (https://www.voyageai.com) — embeddings. Free tier covers this easily.
make app # streamlit run src/app.pyOpen the URL Streamlit prints and ask away. The sidebar has example questions and a pipeline selector so you can compare naive RAG vs the agent live.
make mcpThen add this to your ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"verbatim": {
"command": "python",
"args": ["-m", "src.mcp_server"],
"cwd": "/absolute/path/to/verbatim"
}
}
}Restart Claude Desktop and the four Verbatim tools will appear. Ask "what are users saying about pricing" and Claude will call the tools.
make eval-stub # no API keys needed — proves the harness works
make eval-naive # naive RAG baseline
make eval-agent # the agent
make eval-compare # A/B them side-by-sideResults are persisted to evals/runs/<pipeline>_<timestamp>.json so you
can diff runs and find regressions.
verbatim/
├── docs/dataset_design.md # why synthetic, what facts are planted where
├── evals/
│ ├── ground_truth.json # 12 customer-research eval questions
│ ├── job_ground_truth.json # 6 questions for the job-interview corpus
│ └── runs/ # eval results, persisted for diffing
├── src/
│ ├── types.py # Pipeline Protocol, Citation, PipelineAnswer
│ ├── chunker.py # speaker-turn chunking
│ ├── vector_store.py # Voyage + Chroma wrapper
│ ├── ingest.py # one-shot CLI to build the index
│ ├── stub_pipeline.py # rule-based stub; verifies the eval harness
│ ├── naive_rag.py # retrieve-once baseline
│ ├── agent.py # tool-using LLM with 4 tools
│ ├── mcp_server.py # exposes the agent's tools via MCP
│ ├── judge.py # LLM-as-judge faithfulness scoring (Haiku)
│ ├── app.py # Streamlit chat UI
│ ├── eval.py # the eval harness
│ ├── test_eval.py # unit tests for scoring math
│ └── test_chunker.py # unit tests for chunking
├── transcripts/ # 10 customer-research interviews
├── job_transcripts/ # 6 mock job interviews (the extension)
├── pyproject.toml
├── Makefile
├── .env.example
└── README.md
The system is domain-agnostic. To prove it, the same backbone runs over a
second corpus — six mock job-interview transcripts I recorded for my own
interview prep. Same chunker, same vector store, same agent, same eval
harness. Different ground truth (evals/job_ground_truth.json), different
collection (jobs).
make ingest-jobs # build a separate index
python -m src.eval --pipeline agent_v1 \
--ground-truth evals/job_ground_truth.json # run job-corpus evalsThis was the small extension that demonstrated the architecture's flexibility. The same code that helps a product manager grok customer interviews helps a candidate grok their own mock interviews and spot weak patterns. ("In which interviews did I self-identify a weakness? What was my stock answer for 'tell me about a failure' and what was the better one?")
A weekend project, not a product. Real next steps:
- Prompt versioning. I changed prompts during development and lost track of which version produced which score. The version of every prompt should be tagged so an eval run knows the prompt revision.
- Reranking. Pull more candidates with vector search, then rerank with a cross-encoder before stuffing into context. Likely bumps precision a meaningful amount.
- Fan-out for synthesis questions. "Top three themes" should run multiple sub-queries explicitly rather than relying on a single retrieval. The agent does some of this already; making it first-class would help.
- Per-user access controls. Customer interviews are sensitive; any production version needs row-level access.
- Incremental ingestion. Currently a rebuild is full. New interviews should update the index without re-embedding everything.
- Real-corpus validation. Synthetic data validated the architecture; running this on a partner's real corpus is the honest next test.
Built as a portfolio project to learn agentic search, MCP tooling, and eval-driven development. The synthetic corpus, fictional product "Linkup," and all interview participants are fabricated.