End-to-end book recommender on the UCSD Goodreads public datasets (228M user-book interactions, 2M+ books). Implements three approaches — collaborative KNN, content-based KNN, and a hybrid — with a reproducible offline evaluation showing pure collaborative filtering beats the textbook hybrid by ~50% on NDCG@10 on this dataset.
- Polars-streamed ingest turns 4GB of CSV plus 2GB of gzipped JSON into a 50k × 770k sparse interaction matrix in roughly a minute.
- Three recommenders share infrastructure: an item-based KNN over the sparse interaction matrix, a TF-IDF + author + numeric content KNN, and a hybrid that linearly combines the two.
- Offline evaluation framework (precision / recall / NDCG / MAP / hit-rate
at K, catalog coverage, intra-list diversity) with deterministic
leave-one-out sampling. Generated report:
docs/EVALUATION.md. - Streamlit app with fuzzy title search, score breakdowns, genre distributions, and a metrics dashboard. Deployable via Docker.
- Tests + CI: 68 unit tests, GitHub Actions across Python 3.10 and 3.12, ruff and black on every commit.
| Recommender | NDCG@10 | Hit-Rate@10 | Coverage |
|---|---|---|---|
| Collaborative KNN | 0.333 | 0.758 | 0.114 |
| Hybrid (0.7 collab / 0.3 content) | 0.271 | 0.634 | 0.129 |
| Hybrid (0.6 / 0.4) — production default | 0.228 | 0.564 | 0.133 |
| Content KNN | 0.010 | 0.070 | 0.068 |
The content signal hurts ranking quality at every weight setting; the
production hybrid weights are due for re-tuning. Full discussion in
MODEL.md.
Requires uv:
curl -LsSf https://astral.sh/uv/install.sh | sh # macOS / Linux
# Windows PowerShell:
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Then:
# 1. Install dependencies (project + dev + app groups)
uv sync --all-groups
# 2. Download UCSD Goodreads data (~6GB; one-time)
uv run python src/downloader.py
# 3. JSON / CSV → parquet (polars streaming)
uv run python src/serialize_data.py --batch_size=500_000
# 4. Build the sparse interaction matrix + ID mappings
uv run python prepare_data.py
# Smaller smoke-test sample:
# uv run python prepare_data.py --min_reads=50 --top_books=5000 --top_users=100000
# 5. (Optional) Build content features for the hybrid recommender
uv run python src/extract_metadata.py
uv run python src/create_content_features.py
# 6. Run the Streamlit app
uv run --group app streamlit run app/app_enhanced.py
# → http://localhost:8501 ┌─────────────────────────┐
│ UCSD Goodreads dataset │
│ (~6GB CSV / NDJSON) │
└────────────┬────────────┘
│ src/downloader.py
▼
┌──────────────────────────────┐
│ Polars streaming ingest │ src/serialize_data.py
│ CSV/NDJSON → parquet │
└────────────┬─────────────────┘
│
▼
┌────────────────────────────┐
│ Sparse books × users CSR │ prepare_data.py
│ + ID mappings (parquet) │
└─────┬───────────────┬──────┘
│ │
│ │ src/extract_metadata.py
│ │ src/create_content_features.py
▼ ▼
┌──────────────────────┐ ┌────────────────────────┐
│ Collaborative KNN │ │ Content KNN │
│ (sparse cosine) │ │ (TF-IDF + authors + │
│ │ │ scaled numerics) │
└─────────┬────────────┘ └──────────┬─────────────┘
└──────────────┬───────────┘
▼
┌─────────────────────────┐
│ Hybrid (linear combine) │ src/hybrid_recommender.py
│ + Streamlit app │ app/app_enhanced.py
│ + offline evaluation │ evaluate.py
└─────────────────────────┘
.
├── prepare_data.py # Sparse interaction matrix + ID mappings
├── evaluate.py # End-to-end recommender benchmark
├── src/
│ ├── downloader.py # UCSD data download
│ ├── serialize_data.py # CSV/NDJSON → parquet (polars streaming)
│ ├── extract_metadata.py # Per-book metadata extraction
│ ├── create_content_features.py # TF-IDF + authors + numeric features
│ ├── knn_recommender_sparse.py # Collaborative KNN on sparse matrix
│ ├── hybrid_recommender.py # Collab + content linear combination
│ ├── train_spotlight.py # ALS matrix factorization (implicit lib)
│ └── evaluation.py # Ranking metrics + holdout sampler
├── app/
│ ├── app.py # Basic KNN UI
│ └── app_enhanced.py # Hybrid UI (recommended)
├── tests/ # Unit tests (run with `uv run pytest`)
├── docs/EVALUATION.md # Generated benchmark report
├── notebooks/ # EDA + training viz
├── Dockerfile, docker-compose.yml
└── pyproject.toml, uv.lock
After running the data preparation steps above:
uv run python evaluate.py --n-users 500 --seed 42Writes docs/EVALUATION.md and
data/eval_results.json. Adjust --n-users, --k, and --hybrid-weights
to explore the cost / precision frontier.
# Run the full test suite (currently 77 tests)
uv run pytest
# Lint, format, and type checks
uv run ruff check .
uv run black --check .
uv run mypymypy is configured in pyproject.toml to strict-check the modules
that ship with full type annotations (src/evaluation.py,
src/matching.py, evaluate.py). The rest of the tree is excluded for
now and is being brought up to standard incrementally.
CI runs the same matrix on every push and pull request — see
.github/workflows/ci.yml.
For a deeper development walkthrough, see CONTRIBUTING.md.
For an at-a-glance pipeline reference, see QUICK_START.md.
# Build and run the hybrid Streamlit app on http://localhost:8501.
docker-compose up --build
# docker-compose.yml mounts ./data:/app/data:ro so the prepared
# parquet/npz files are visible to the container without baking
# them into the image.If src/downloader.py fails (Google Drive throttling, etc.), download these
files manually and save them to data/:
- UCSD Book Graph — Mengting Wan and Julian McAuley, Item Recommendation on Monotonic Behavior Chains, RecSys 2018.
- Reference reading: Collaborative Filtering Deep Dive (Jeremy Howard).