Retrieval-augmented generation chatbot running on local infrastructure using Chainlit, Haystack, Ollama, and Qdrant.
Answers domain questions grounded in static company documents (txt / md) with citation-style source references. Includes a typed tool call for vacation-days lookup backed by a simple username/password simulation.
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.12 | Pinned in .python-version |
| uv | latest | Sole dependency/env manager — no pip/poetry |
| Ollama | latest | Runs models locally |
| Docker-compatible runtime | latest | Used to run Qdrant (Docker Desktop, Colima+Docker CLI, Podman, Rancher Desktop) |
git clone <repo-url>
cd chatbot
uv syncuv sync creates the virtual environment (.venv/), pins Python 3.12, and installs all dependencies from uv.lock.
cp .env.example .env
# Edit .env as needed — defaults work for a standard local setupKey variables (all have sensible defaults):
| Variable | Default | Description |
|---|---|---|
CHAT_MODEL_PROVIDER |
ollama |
ollama or openai_compatible |
CHAT_BASE_URL |
http://localhost:11434 |
Chat model endpoint — Ollama URL or OpenAI-compatible base URL (e.g. https://api.groq.com/openai/v1) |
CHAT_API_KEY |
— | API key for openai_compatible providers (not used for Ollama) |
CHAT_MODEL |
qwen3.5:9b |
Generation model identifier — Ollama name or provider model id (e.g. qwen3-32b, Qwen/Qwen3-235B-A22B) |
EMBEDDING_MODEL_PROVIDER |
ollama |
Currently only ollama is supported |
EMBEDDING_BASE_URL |
http://localhost:11434 |
Ollama server URL for embeddings |
EMBEDDING_MODEL |
bge-m3 |
Embedding model |
QDRANT_HOST |
localhost |
Qdrant hostname |
QDRANT_PORT |
6333 |
Qdrant port |
QDRANT_COLLECTION |
chatbot |
Collection name |
CORPUS_PATH |
corpus |
Source document directory |
RETRIEVAL_TOP_K |
5 |
Chunks returned per query |
RETRIEVAL_LLM_TOP_K |
— | Max documents passed to LLM after RRF fusion; default uses RETRIEVAL_TOP_K |
EMBEDDING_DIM |
1024 |
Embedding vector dimension — must match EMBEDDING_MODEL output |
SPLIT_LENGTH |
200 |
Words per ingestion chunk |
SPLIT_OVERLAP |
20 |
Word overlap between adjacent chunks |
VISION_INGESTION_ENABLED |
true |
Ingest images via a vision model; set false to skip all image content |
VISION_MODEL |
qwen2.5vl:7b |
Vision model for image description generation |
VISION_PROVIDER |
ollama |
Vision model backend — currently only ollama |
VISION_BASE_URL |
http://localhost:11434 |
Ollama server URL for vision-model calls |
IMAGE_CACHE_DIR |
.cache/image_descriptions |
Cache directory for vision-model descriptions (keyed by content hash) |
EXTRACTED_IMAGE_DIR |
.cache/extracted_images |
Directory for PDF-extracted images (surfaced as citation images) |
IMAGE_MIN_DIMENSION |
64 |
Drop images whose width or height is below this value (pixels) |
IMAGE_MIN_DESCRIPTION_LENGTH |
40 |
Drop descriptions shorter than this (chars) — filters decorative images |
LOG_FORMAT |
console |
console (dev) or json (CI/prod) |
OTEL_ENABLED |
false |
Enables OpenTelemetry tracing export |
OTEL_SERVICE_NAME |
chatbot |
Trace resource service name |
PHOENIX_PROJECT_NAME |
chatbot-local |
Phoenix project name shown in the UI |
OTEL_DEPLOYMENT_ENVIRONMENT |
development |
OTel deployment environment resource attribute |
OTEL_PHOENIX_OTLP_ENDPOINT |
http://localhost:6006/v1/traces |
Phoenix OTLP/HTTP trace endpoint |
OTEL_EXPORT_PHOENIX |
true |
Enables Phoenix export when OTEL_ENABLED=true |
OTEL_EXPORT_JAEGER |
true |
Enables Jaeger export when OTEL_ENABLED=true |
OTEL_JAEGER_OTLP_ENDPOINT |
http://localhost:4318/v1/traces |
Jaeger OTLP endpoint (HTTP default; non-/v1/traces endpoints use gRPC mode) |
OTEL_SAMPLE_RATE |
1.0 |
Root trace sampling ratio (0.0 to 1.0) |
OTEL_CONSOLE_EXPORT |
false |
Also print spans to stdout |
OTEL_AUTO_INSTRUMENT_HAYSTACK |
true |
Enable OpenInference auto-instrumentation for Haystack |
MODEL_TEMPERATURE |
0.0 |
Chat model temperature (0.0 = deterministic, 1.0 = creative) |
MODEL_SEED |
42 |
Chat model seed for reproducible generations |
EVAL_ENVIRONMENT |
local |
Evaluation environment label attached to all traces |
EVAL_NAME |
— | Evaluation cycle name (e.g. rag-prompt-tuning-2026-04) |
EVAL_RUN_ID |
— | Run identifier; auto-generated per process if unset |
EVAL_CANDIDATE_ID |
— | Candidate ID for comparing prompt/model variants |
EVAL_PROMPT_VERSION_ANSWER |
— | Version label for the answer-generation prompt |
EVAL_PROMPT_VERSION_CITATION |
— | Version label for the citation-pass prompt |
EVAL_RETRIEVAL_VERSION |
— | Version label for retrieval config; auto-derived from RETRIEVAL_TOP_K if unset |
EVAL_CORPUS_VERSION |
— | Corpus snapshot identifier |
EVAL_DATASET_VERSION |
— | Dataset snapshot identifier |
EVAL_JUDGE_MODEL |
llama3.1:8b |
LLM judge model for eval evaluators |
EVAL_JUDGE_PROVIDER |
ollama |
LLM judge backend: ollama or openai_compatible |
EVAL_JUDGE_BASE_URL |
— | Judge provider URL; defaults to http://localhost:11434 for Ollama |
EVAL_JUDGE_API_KEY |
— | API key for openai_compatible judge provider |
EVAL_JUDGE_INITIAL_PER_SECOND_REQUEST_RATE |
1.5 |
Initial judge request rate (lower to avoid rate-limit errors) |
# Runtime default chat model
ollama pull qwen3.5:9b
# Embeddings
ollama pull bge-m3
# Vision model (for multi-modal ingestion — can be skipped if VISION_INGESTION_ENABLED=false)
ollama pull qwen2.5vl:7b
# Test chat model (used by pytest via tests/conftest.py)
ollama pull llama3.2The embedding and vision models always run locally via Ollama. The chat model can alternatively run via a cloud provider — see Using a cloud/remote chat model below.
For tests, the suite pins CHAT_MODEL=llama3.2 to keep test environments fast and reproducible even when the runtime default model changes.
docker run -d --name qdrant -p 6333:6333 qdrant/qdrantQdrant dashboard: http://localhost:6333/dashboard
macOS quick option (recommended): Colima + Docker CLI
brew install colima docker
colima start
docker run -d --name qdrant -p 6333:6333 qdrant/qdrantDocker Desktop is not always required. For many local-dev setups, Colima is the simplest free alternative.
- Keep one embedding model (and vector dimension) per collection.
- Use deterministic point IDs (
doc_id + chunk_index) for idempotent re-indexing. - Persist data with a volume in real usage (not only ephemeral container state).
- Treat retrieval params (
top_k, score threshold, chunk size) as evaluation-controlled settings.
Set CHAT_MODEL_PROVIDER=openai_compatible to replace the local Ollama chat model with any OpenAI-API-compatible endpoint. Embeddings always remain local.
Groq (free tier, fast):
CHAT_MODEL_PROVIDER=openai_compatible
CHAT_BASE_URL=https://api.groq.com/openai/v1
CHAT_API_KEY=gsk_...
CHAT_MODEL=qwen3-32bTogether AI (pay-as-you-go, larger models):
CHAT_MODEL_PROVIDER=openai_compatible
CHAT_BASE_URL=https://api.together.xyz/v1
CHAT_API_KEY=...
CHAT_MODEL=Qwen/Qwen3-235B-A22BAny other provider with an OpenAI-compatible /chat/completions endpoint works the same way.
uv run chainlit run src/chatbot/ui/app.pyOpens a chat interface at http://localhost:8000 by default.
This project can emit OpenTelemetry traces for the full chat pipeline (UI turn, orchestrator rounds, model call, retrieval tool, and Qdrant retrieval).
uv tool run --from arize-phoenix python -m phoenix.server.main servePhoenix UI: http://localhost:6006
docker run --rm --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-e COLLECTOR_OTLP_GRPC_HOST_PORT=0.0.0.0:4317 \
-e COLLECTOR_OTLP_HTTP_HOST_PORT=0.0.0.0:4318 \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:1.59Jaeger UI: http://localhost:16686
For a standard local setup, only this is required:
OTEL_ENABLED=trueAll other tracing variables already have sensible defaults. In this project, Jaeger defaults to OTLP/HTTP (/v1/traces). Set them only when you want to override targets or behavior.
Optional overrides:
OTEL_SERVICE_NAME=chatbot
OTEL_PHOENIX_OTLP_ENDPOINT=http://localhost:6006/v1/traces
OTEL_EXPORT_PHOENIX=true
OTEL_EXPORT_JAEGER=true
OTEL_JAEGER_OTLP_ENDPOINT=http://localhost:4318/v1/traces
OTEL_SAMPLE_RATE=1.0
OTEL_CONSOLE_EXPORT=falseOptional Jaeger OTLP gRPC mode:
OTEL_JAEGER_OTLP_ENDPOINT=localhost:4317Use gRPC mode only when port 4317 is reachable from the app process.
With OTEL_ENABLED=true, both exports are enabled by default. You can disable either backend independently:
# Phoenix only
OTEL_EXPORT_PHOENIX=true
OTEL_EXPORT_JAEGER=false
# Jaeger only
OTEL_EXPORT_PHOENIX=false
OTEL_EXPORT_JAEGER=true./chatbot.shAsk one or two questions in Chainlit, then open http://localhost:6006 and inspect traces for service chatbot.
You can inspect the same traces in Phoenix (http://localhost:6006) and Jaeger (http://localhost:16686) when both exporters are enabled.
- Root UI span for each message (
chat.ui.on_message) - Orchestrator spans (
chat.orchestrator.process_message,chat.orchestrator.round) - Model call span (
chat.model.ollama.streamorchat.model.openai_compatible.stream) with request/response previews - Tool spans (
chat.tool.search_documents) with tool input and result summary - Retriever span (
chat.retriever.qdrant.retrieve) with top-k result preview
Tracing follows a strict ownership model so each span level contributes one readable view only:
- UI span (
chat.ui.on_message): user input preview + final emitted assistant response preview. - Orchestrator spans (
chat.orchestrator.*): control-flow only (round state, tool dispatch summaries, citation-pass diagnostics). - Model span (
chat.model.ollama.stream): compact request message summary + model output preview. - Tool spans (
chat.tool.*): tool input plus tool-specific result summary. - Retriever span (
chat.retriever.qdrant.retrieve): retrieval parameters + compact chunk/content previews.
Canonical span names are centralized in src/chatbot/observability/schema.py and should be reused instead of hard-coded string literals.
- No traces in Phoenix:
- Verify Phoenix is running on
http://localhost:6006. - Verify
OTEL_ENABLED=true,OTEL_EXPORT_PHOENIX=true, and endpoint matcheshttp://localhost:6006/v1/traces.
- Verify Phoenix is running on
- No traces in Jaeger:
- Verify Jaeger is running with OTLP enabled (
COLLECTOR_OTLP_ENABLED=true). - Verify Jaeger OTLP receivers are bound to all interfaces (
0.0.0.0) when running in Docker. - For default OTLP/HTTP mode, verify endpoint is
http://localhost:4318/v1/traces. - For gRPC mode, verify port mapping includes
-p 4317:4317and endpoint islocalhost:4317. - Verify
OTEL_ENABLED=trueandOTEL_EXPORT_JAEGER=true. - Set
OTEL_CONSOLE_EXPORT=truetemporarily to confirm spans are produced.
- Verify Jaeger is running with OTLP enabled (
- Too many traces:
- Lower
OTEL_SAMPLE_RATE, e.g.0.2.
- Lower
Place source documents (.txt, .md) in the directory configured by CORPUS_PATH (default: corpus/), or use the example documents already provided in corpus/.
# Index (or re-index) all documents in CORPUS_PATH
uv run python -m src.ingest.cli reindex
# Wipe the Qdrant collection and re-index from scratch
uv run python -m src.ingest.cli reset
# Wipe the collection only, without re-indexing
uv run python -m src.ingest.cli reset --wipe-onlyreindex uses an OVERWRITE duplicate policy, so running it again on unchanged files is safe — it replaces stale vectors rather than creating duplicates.
# Unit tests (no external services needed)
uv run pytest tests/unit/
# Integration tests (requires Qdrant + Ollama running; pytest pins CHAT_MODEL=llama3.2)
INTEGRATION_TESTS=1 uv run pytest tests/integration/
# All tests
uv run pytestIntegration tests are skipped automatically unless INTEGRATION_TESTS=1 is set.
To run them: start Qdrant and Ollama first, then run the command above.
uv run ruff check . # lint
uv run ruff format . # format (in-place)
uv run ruff format --check . # format check only (CI mode)
uv run pyright # strict type checkAll four commands must pass cleanly before merging any branch. CI enforces this automatically on every pull request.
# Run experiment against the default dataset (eval/datasets/rag_questions.jsonl)
uv run --group eval python eval/run_experiment.py
# Custom dataset / experiment name
uv run --group eval python eval/run_experiment.py \
--dataset-file eval/datasets/rag_questions.jsonl \
--experiment-name "retrieval-top-k-5"
# Quick sanity check (1 example, no Phoenix upload)
uv run --group eval python eval/run_experiment.py --dry-runDrives the ChatOrchestrator directly (bypasses Chainlit) and records results in Arize Phoenix. Requires OTEL_ENABLED=true and a running Phoenix instance. See eval/README.md for a full feature overview and dataset management guide.
Evaluation metadata (run ID, prompt versions, candidate ID, …) is controlled via EVAL_* env vars — see the configuration table above.
src/
shared/ Cross-cutting infrastructure shared by chatbot and ingest
settings/ Settings + get_settings (pydantic-settings, single source of truth)
observability/ Logging (structlog) and OpenTelemetry tracing bootstrap
qdrant/ Shared Qdrant document-store factory and sparse embedder
chatbot/ Chatbot application
app/ Orchestrator, credential store, prompts
citation/ CitationModel, citation parser, citeable-tool messages
contracts/ Protocol interfaces: ChatModel, Retriever, CredentialStore, …
ui/ Chainlit UI layer, session lifecycle, composition root
infrastructure/
chat/ Ollama and OpenAI-compatible chat model adapters
embeddings_text/ Query-time text embedder (Ollama)
retrieval/ Hybrid Qdrant vector+sparse retriever
tools/
retrieval/ Document retrieval tool (CiteableTool)
vacation_days/ Vacation-days lookup tool
build_from_settings.py Settings → chatbot component factory
ingest/ Ingestion pipeline
app/ Ingestion pipeline logic (chunking, embedding, indexing)
contracts/ Protocol interfaces: FormatHandler, DocumentConverter, …
cli/ Developer CLI (reindex / reset / wipe-only)
infrastructure/
converters/ Format converters (PDF, image)
embeddings_document/ Document embedder (Ollama)
image_cache/ Content-hash cache for vision-model descriptions
image_description.py Image description service and filter config
vision/ Vision model adapter (Ollama)
build_from_settings.py Settings → ingest component factory
eval/
run_experiment.py Offline evaluation runner (Phoenix Experiments)
datasets/ Curated evaluation datasets (JSONL)
results/ Stored experiment result files
tests/
unit/ Fast, self-contained tests for business logic
integration/ End-to-end tests requiring live services
doc/ Architecture and requirement specifications
corpus/ (git-ignored) Source documents for ingestion
Module boundaries use Protocol-based interfaces — orchestration code never imports infrastructure (Qdrant client, HTTP clients) directly.
- Single-user only: Chainlit's WebSocket-based session model is inherently stateful and not designed for horizontal scaling. See
doc/05-delivery-plan.md→ Multi-User and Stateless Architecture for the production path. - Local model quality variance: Answer quality depends on the model chosen. Model pinning and benchmark-based acceptance gates mitigate regressions.
- PDF extraction quality: Complex layouts (tables, columns) may degrade extraction fidelity, which directly affects retrieval quality.
- No production IAM: Auth is simulated via username/password at tool-call time. Session-scoped only; no OAuth2/OIDC integration.
- Prompt injection: Partially mitigated via source-grounded answering and instruction policy. Full red-teaming is post-MVP.
- No dynamic document upload: Corpus changes require a CLI re-index. End-user upload is out of MVP scope.