This project implements a Retrieval-Augmented Generation (RAG) pipeline for efficient document processing and knowledge retrieval. It extracts text and tables from PDFs using the Unstructured library, stores raw PDFs in Redis, and indexes extracted embeddings in PGVector for semantic search. The system leverages MultiVector Retriever for context retrieval before querying an LLM (Gemini model).
Live Site: 🌐 smart-doc-assistant.streamlit.app
- Unstructured Document Processing: Extracts text and tables from PDFs.
- Redis for Raw Storage: Stores and retrieves raw PDFs efficiently, to implement persistent storage.
- PGVector for Vector Storage: Indexes and retrieves high-dimensional embeddings for similarity search.
- MultiVector Retriever: Optimized for retrieving contextual information from multiple sources.
- LLM Integration: Uses a Gemini model to generate responses based on retrieved context.
- Python
unstructuredpgvectorredislangchaingemini-flash
- Redis: For raw PDF storage
- PostgreSQL + PGVector: For embeddings storage
- Gemini-Flash (via Gemini API Key)
