Biblios is a high-performance book recommendation system that transcends traditional keyword-based search. By leveraging Large Language Models (LLMs) and Vector Databases, it understands the deeper context, themes, and emotional resonance of a narrative to provide truly relevant discoveries.
- Language: Python 3.13
- AI/LLM Frameworks: LangChain, Hugging Face Transformers
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2(Local Execution) - Vector Database: ChromaDB (via
langchain-chroma) - Data Science: Pandas, NumPy, Matplotlib, Seaborn
- UI Framework: Gradio
- Secrets Management:
python-dotenv
The project is structured into five distinct engineering phases:
- Exploratory Data Analysis (EDA): Performed correlation analysis using Spearman heatmaps to ensure data missingness was not biased.
- Data Cleaning: Engineered features like
age_of_bookandtitle_and_subtitlewhile filtering for semantic density (25-word minimum description threshold). - Tagging: Prepended unique ISBN identifiers to descriptions to allow for precise metadata retrieval after vector matching.
- Semantic Embeddings: Swapped OpenAI for a locally-hosted Hugging Face model (
all-MiniLM-L6-v2) to generate high-dimensional vectors for 5,000+ books. - Indexing: Utilized ChromaDB for efficient K-Nearest Neighbor (KNN) retrieval.
- Query Logic: Implemented a similarity search that translates natural language (e.g., "a story about redemption in the Arctic") into mathematical coordinates to find thematic neighbors.
- Zero-Shot Learning: Implemented
facebook/bart-large-mnlito classify books into categories (Fiction vs. Non-Fiction) without the need for pre-labeled training data. - Data Augmentation: Used model inference to fill gaps in the original dataset's labeling.
- Emotion Mapping: Leveraged a fine-tuned RoBERTa model to detect sentence-level sentiment across seven categories (Joy, Fear, Sadness, etc.).
- Scoring: Stored maximum probability scores to allow users to sort recommendations by their desired "vibe."
- Interactive UI: A "Glass" themed dashboard that provides a polished gallery view of book covers and metadata.
- Dynamic Controls: Users can combine semantic queries with category filters and emotional tone sorting.
Building Biblios involved solving several real-world engineering obstacles:
- The Persistence Problem: Encountered duplicate entries in ChromaDB during re-indexing. Solved by implementing a manual collection wipe/reset logic before indexing runs.
- Data Parsing Issues: Fixed a
ValueErrorduring ISBN retrieval where CSV export quotes were interfering with integer conversion. - Version Mismatches: Resolved
langchainandpandasdeprecation warnings (specifically thesepargument into_csvandchunk_sizeconstraints inCharacterTextSplitter) by upgrading to modern syntax.
-
Clone the Repo:
git clone [https://github.com/yourusername/biblios.git](https://github.com/yourusername/biblios.git) cd biblios -
Install Dependencies:
pip install pandas seaborn matplotlib langchain-huggingface langchain-chroma sentence-transformers python-dotenv gradio
-
Environment Variables: Create a
.envfile (see.env.examplefor the template). Note: This project is configured to run Hugging Face embeddings locally, reducing API dependency.
(Note to Self: Insert your Spearman Correlation Heatmap here to show off the EDA!)
Note: This project was developed as a deep dive into Semantic Search and Agentic AI workflows.