A minimal retrieval-augmented question answering system built with sentence embeddings, FAISS vector search, and FLAN-T5 generation.
Large language models often answer questions using only their internal knowledge.
Retrieval-augmented generation (RAG) improves this by retrieving relevant external information and grounding the answer in that evidence.
This project implements a small end-to-end RAG pipeline that:
- loads a collection of documents
- splits documents into smaller chunks
- embeds chunks using a sentence embedding model
- stores embeddings in a FAISS vector index
- retrieves the most relevant chunks for a query
- generates an answer using FLAN-T5 conditioned on the retrieved context
The goal of the project was to understand how retrieval and generation interact in modern NLP systems.
The system loads a small text corpus from documents.txt.
Documents are split into smaller chunks so retrieval can focus on relevant pieces of information rather than entire documents.
Chunking improves retrieval specificity but introduces trade-offs when useful context spans multiple chunks.
Each chunk is embedded using:
sentence-transformers/all-MiniLM-L6-v2
This model produces dense vectors that place semantically similar text near each other in embedding space.
Embeddings are L2 normalized and stored in a FAISS index using:
faiss.IndexFlatIP
This approximates cosine similarity search between query and document vectors.
Given a query:
- The query is embedded
- The FAISS index retrieves the top-k most similar chunks
- Retrieved chunks are combined into a context block
The retrieved context is passed to:
google/flan-t5-small
The model generates an answer conditioned on the retrieved evidence, ensuring responses are grounded in the document corpus.
What is FAISS used for?
Retrieved chunks:
- FAISS is a library for efficient similarity search over dense vectors
- It is often used in retrieval-augmented generation systems
Generated answer:
similarity search
Where is Mount Everest?
Generated answer:
Himalayas
What is Python used for?
Generated answer:
machine learning, web development, and automation
Which river flows into the Mediterranean?
Generated answer:
The Nile
Smaller chunks improve retrieval precision but can separate useful context across multiple segments.
Even when the correct information exists in the corpus, the most relevant chunk is not always ranked first.
Retrieving multiple chunks helps mitigate ranking errors and increases the likelihood that the correct evidence appears in context.
The answer generator can only produce accurate responses if relevant information is retrieved first.
- Python
- SentenceTransformers
- FAISS
- NumPy
- PyTorch
- HuggingFace Transformers
- FLAN-T5-small
- Google Colab GPU
main.py
End-to-end RAG pipeline including chunking, embedding, retrieval, and generation.
documents.txt
Example document corpus.
requirements.txt
Python dependencies.
Possible extensions include:
- larger document collections
- improved chunking strategies
- saving and loading FAISS indexes
- evaluation metrics for retrieval quality
- reranking methods for improved retrieval accuracy
- interactive QA interfaces