Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
1760025
hybrid search
May 15, 2026
babe5e2
Back to openai
May 15, 2026
3a3630c
Semantic only mode
May 15, 2026
9891a04
incremental indexing
May 15, 2026
37a17b5
cleanup
May 15, 2026
7b4a970
Add multi-index semantic search architecture with testbed UI
May 19, 2026
6f3a8f9
testing scripts, results, reports and documentation
May 25, 2026
c6b22c8
prod notes
May 25, 2026
d343c1b
more query analyses
May 25, 2026
4f2b863
mxbai toggle for prod PR
May 26, 2026
37b2ad3
readme update
May 26, 2026
57fc307
PR cleanup
May 26, 2026
18fc59c
more cleanup for prod
May 27, 2026
1e9cd71
Revert dev port to 5000 and remove test results directory
May 28, 2026
11e36b6
minor tweaks
May 28, 2026
a76776e
revert gitignore chagnes, minor readme changes
May 28, 2026
7a711e2
Offload index-time embedding to HF Dedicated Endpoint
May 28, 2026
ff6f4f1
Harden remote-embedding retry path and lift /index harakiri
May 28, 2026
3ffc643
Cleanup remote-embed pipeline for clarity and a few wasted copies
May 28, 2026
b97cde4
DX: SEMANTIC_ENABLED top-level switch, README cleanup
May 28, 2026
a1eadc7
Pin DEFAULT_SEMANTIC_MODEL explicitly instead of using dict iteration…
May 28, 2026
2197e70
Split bulk timeout into explicit lexical / semantic constants
May 28, 2026
9f8960e
Collapse SEARCH_MODES / SEMANTIC_MODES into two string constants
May 28, 2026
6eb4910
Use a SearchMode enum instead of bare string constants
May 28, 2026
1581ef3
Consolidate /search into one mode branch with early-return
May 28, 2026
3b98336
Fix /index?model=<garbage> silently building all semantic indexes
May 28, 2026
18f6e04
/index: switch ?model= to ?targets= (comma-separated subset)
May 28, 2026
84e02da
README audit: fix stale Adding-a-model and Batch-eval sections
May 28, 2026
ad1e124
README: correct the /search default-model description
May 28, 2026
79808a5
Trim verbose comments around remote-embed tuning knobs
May 28, 2026
6eabebb
formatting
May 28, 2026
b87db4e
cleanup
May 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion .env.sample
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,21 @@ AWS_SECRET=secret

ELASTIC_PASSWORD=123
ES_PORT=9200
ES_STACK_VERSION=8.9.0
ES_STACK_VERSION=8.16.0

INDEXING_PASSWORD=index123

# Ollama must be running on the host with mxbai-embed-large pulled.
# On Linux set OLLAMA_URL explicitly — see README.
# OLLAMA_URL=http://host.docker.internal:11434
SEMANTIC_ENABLED=true
# Index-time embedding via your own HF Dedicated Inference Endpoint (TEI-backed).
# Query-time embedding always stays on local Ollama. Unset either var → ES
# embeds via Ollama at index time too (slower; fine for local dev).
HUGGING_FACE_KEY=key
# HF Inference Endpoint base URL (no trailing slash, no /v1/embeddings — we append).
HF_DEDICATED_URL=https://some-endpoint

# Grafana Cloud — logs (Loki)
# Get from: grafana.com → your stack → Loki card → "Send Logs"
# Token: My Account → Access Policies → token with logs:write scope
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
db/01-hadithTable.sql
db/01-hadithdb.sql
.env
__pycache__
data/
233 changes: 226 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,229 @@
# Search
To run:
docker-compose up --build
# sunnah.com Search API

Then visit:
Flask + Elasticsearch search service for sunnah.com. Supports lexical (BM25) and semantic search.

---

## Architecture

```
Browser / PHP website
Flask API (this repo) ──► Elasticsearch
┌───────────┴───────────┐
│ english-lexical │ BM25, no embeddings
│ english-mxbai │ mxbai-embed-large vectors
└───────────────────────┘

Ollama (host, port 11434) — embeds search queries
HF Dedicated Endpoint (optional) — embeds documents at index time
```

Each index name in ES is an **alias** (e.g. `english-mxbai`) pointing to a timestamped backing index. Reindexing builds a new backing index and atomically swaps the alias — the live index keeps serving traffic during the rebuild.

---

## Local development setup

### Prerequisites

- Docker + Docker Compose
- [Ollama](https://ollama.com) installed and running on your machine

### 1. Configure environment

```bash
cp .env.sample .env
```

Semantic search is on by default (`SEMANTIC_ENABLED=true`). Set it to `false` if you want lexical-only and don't want to run Ollama. `OLLAMA_URL` defaults to `http://host.docker.internal:11434`, which works on Docker Desktop (Mac/Windows) — leave it unset locally.

To offload index-time embedding to a HuggingFace Dedicated Inference Endpoint (recommended for prod — orders of magnitude faster on a small GPU than Ollama on a CPU instance), also set `HUGGING_FACE_KEY` and `HF_DEDICATED_URL` in `.env`. The endpoint must run [TEI](https://github.com/huggingface/text-embeddings-inference) with `mixedbread-ai/mxbai-embed-large-v1`. Leaving either var unset falls back to embedding via Ollama at index time too.

### 2. Pull the model

```bash
ollama pull mxbai-embed-large
```

### 3. Start the stack

```bash
docker compose up --build
```

Flask is exposed on **port 5000**.

### 4. Build the indexes

```
http://localhost:5000/index?password=index123
```

This reads all hadiths from MySQL and builds **both** the lexical and semantic indexes by default — that's almost always what you want. Embedding ~48k English hadiths takes ~9 min via the HF Dedicated Endpoint (or considerably longer through Ollama if no remote endpoint is configured).

To build a subset, pass `targets=` (comma-separated):
```
http://localhost:5000/index?password=index123&targets=lexical # lexical only
http://localhost:5000/index?password=index123&targets=mxbai # one semantic model
http://localhost:5000/index?password=index123&targets=lexical,mxbai # both (same as default)
```

To force a full rebuild instead of incremental:
```
http://localhost:5000/index?password=index123&rebuild=true
```

Check index status (doc counts):
```
http://localhost:5000/index/status
```

---

## Production deployment

Production uses `docker-compose.prod.yml` directly. Key differences from local:
- **No MySQL service** — connects to the existing external DB via env vars
- **uwsgi** instead of Flask dev server, exposed on **port 7650**
- **Persistent ES data** in a named Docker volume (`es-data`)
- **Explicit ES JVM memory limits** (`-Xms600m -Xmx1g`)

### 1. Configure environment

```bash
cp .env.sample .env
```

Fill in production values — at minimum:

```env
MYSQL_HOST=<prod db host>
MYSQL_USER=<user>
MYSQL_PASSWORD=<password>
MYSQL_DATABASE=hadithdb

ELASTIC_PASSWORD=<strong password>
INDEXING_PASSWORD=<strong password>

SEMANTIC_ENABLED=true
```

### 2. Ollama on Linux

Install [Ollama](https://ollama.com) on the host and pull the model before starting the stack:

```bash
ollama pull mxbai-embed-large
```

`host.docker.internal` only works on Docker Desktop (Mac/Windows), not on Linux. The prod compose file adds `host-gateway` so this hostname resolves correctly on Linux too — the default `OLLAMA_URL` works without any extra `.env` changes.

### 3. Start the stack

```bash
docker compose -f docker-compose.prod.yml up -d --build
```

### 4. Build the indexes

The prod stack is exposed on **port 7650**. Builds both lexical and semantic by default:

```
http://<server>:7650/index?password=<INDEXING_PASSWORD>
```

Add `&targets=lexical` or `&targets=mxbai` to build a subset.

Check index status:
```
http://<server>:7650/index/status
```

---

## Embedding model

| Key | Model | Query-time | Index-time | Dimensions |
|---|---|---|---|---|
| `mxbai` | mxbai-embed-large | Ollama (host) | HF Dedicated Endpoint (optional) → else Ollama | 1024 |

Queries are always embedded via **Ollama on the host machine** (not inside Docker) — the container reaches it at `http://host.docker.internal:11434` via ES 8.16's OpenAI-compatible inference endpoint. Index-time embedding is offloaded to a remote TEI endpoint when `HUGGING_FACE_KEY` + `HF_DEDICATED_URL` are set: the indexer fetches vectors over HTTP and ships them inline with the bulk payload (ES's `semantic_text` accepts pre-populated chunks and skips its own inference call). Vectors from TEI and Ollama for the same model are bit-compatible (cosine ≈ 0.9999), so queries can match docs embedded by either side.

Per-run tuning via env vars: `HF_DEDICATED_CONCURRENCY` (default 4), `HF_DEDICATED_BATCH_SIZE` (default 16, must keep `batch × max_input_length ≤ TEI's max_batch_tokens`), `HF_DEDICATED_RPM` (default -1, disabled).

### Adding a model

1. Add an entry to `EMBEDDING_MODELS` in `main.py` — copy the mxbai entry as a template (~10 lines).
2. Pull the model on the Ollama host: `ollama pull your-model-name`.
3. Hit `/index?password=...&targets=newkey` to build its index. (`/index` with no `targets=` will pick it up too, alongside lexical and the other semantic models.)
4. Add the alias name to `SEMANTIC_INDEXES` in `tests/batch_search.py`.
5. If it should be the default for `/search?mode=semantic` without a `&model=` param, point `DEFAULT_SEMANTIC_MODEL` at the new key.

`SEMANTIC_ENABLED` is a single global toggle — you don't add a per-model env var.

---

## Search modes

| Mode | What it does |
|---|---|
| `lexical` | BM25 full-text search with collection boosts. Fast, exact keyword matching. Default. |
| `semantic` | Embedding similarity via HNSW approximate nearest-neighbor. Finds conceptually related hadiths even without keyword overlap. |

Mode is passed as a query parameter:
```
/english/search?q=prayer&mode=semantic
/english/search?q=prayer&mode=lexical
```

`mode=semantic` uses the model named in `DEFAULT_SEMANTIC_MODEL` (currently `mxbai`) when no `&model=` is supplied. Pass `&model=<key>` to pick a different enabled model.

---

## API endpoints

| Endpoint | Description |
|---|---|
| `GET /<language>/search?q=...` | Main search endpoint (consumed by PHP website) |
| `GET /index?password=...` | Build/rebuild ES indexes from MySQL |
| `GET /index/status` | Doc counts for all indexes |

---

## Docker Compose files

| File | When to use |
|---|---|
| `docker-compose.yml` | Local development. `docker compose up --build`. |
| `docker-compose.prod.yml` | Production. Run with `-f docker-compose.prod.yml`. Uses uwsgi, persistent ES data volume, explicit JVM memory limits, no MySQL service. |

**Why Elasticsearch has a fixed IP** (`172.31.250.10`): at high request rates, Docker's embedded DNS resolver becomes a bottleneck and throws `EAI_AGAIN` errors. Hardcoding the IP in `/etc/hosts` via `extra_hosts` makes every lookup instant.

**Observability services** (`es-exporter`, `alloy`) ship ES metrics and logs to Grafana Cloud. They require Grafana Cloud credentials in `.env` — if you don't have them, these services will fail to connect but won't break the rest of the stack.

---

## Batch evaluation

`tests/batch_search.py` runs a fixed set of queries across lexical and semantic and produces a CSV and markdown report for side-by-side comparison.

```bash
docker exec search-web-1 python3 /code/tests/batch_search.py
```

Outputs (`batch_results.csv`, `batch_report.md`) land in the repo root — the dev compose mounts `./:/code`, so files the script writes to `/code/` inside the container show up on the host immediately. No `docker cp` needed.

The script runs inside the container because ES is not exposed to the host — it's only reachable at `http://elasticsearch:9200` from within the Docker network.

Edit `QUERIES` in `tests/batch_search.py` to change which queries are tested.

**Note:** always use commas between query strings in the list. Python silently concatenates adjacent string literals without a comma, producing wrong queries with no error.

---

## Formatting

To run:
docker-compose up --build
docker-compose -f docker-compose.prod.yml -d up --build
Format Python code with `uv format` before committing.
5 changes: 5 additions & 0 deletions docker-compose.prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ services:
# consulted before DNS, so every lookup is instant.
extra_hosts:
- "elasticsearch:172.31.250.10"
# host.docker.internal resolves automatically on Docker Desktop (Mac/Windows)
# but not on Linux. host-gateway is Docker's built-in alias for the host's
# IP on the bridge network, making host.docker.internal work on Linux too.
# This is what lets the container reach Ollama running on the host.
- "host.docker.internal:host-gateway"
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:${ES_STACK_VERSION}
container_name: elasticsearch
Expand Down
1 change: 0 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ services:
- GC_PROM_USER=${GC_PROM_USER}
- GC_PROM_PASSWORD=${GC_PROM_PASSWORD}
- DEPLOY_ENV=${DEPLOY_ENV:-local}

networks:
default:
driver: bridge
Expand Down
Loading