Skip to content

asmitamitra/Traditional-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Traditional RAG

A Python implementation of Retrieval-Augmented Generation (RAG) using LangChain, ChromaDB, and language models via Groq API. This project demonstrates how to build a system that retrieves relevant documents and generates contextual responses.

Features

  • πŸ“„ Multi-format Document Loading: Support for PDF and text files
  • πŸ” Vector Search: Efficient similarity search using FAISS and ChromaDB
  • 🧠 LLM Integration: Powered by Groq's fast language models
  • πŸ“ Embeddings: Sentence-transformers for high-quality document embeddings
  • πŸ”— LangChain Integration: Built with LangChain for flexible RAG pipelines
  • πŸ“Š Jupyter Notebooks: Ready-to-use notebooks for PDF loading and document querying

Project Structure

traditional-rag/
β”œβ”€β”€ main.py                      # Main application entry point
β”œβ”€β”€ pyproject.toml              # Project metadata and dependencies
β”œβ”€β”€ requirements.txt            # pip dependencies
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ CONTRIBUTING.md             # Contribution guidelines
β”œβ”€β”€ LICENSE                     # MIT License
β”œβ”€β”€ .gitignore                  # Git ignore rules
β”œβ”€β”€ .env.example                # Environment variables template
β”‚
β”œβ”€β”€ data/                       # Data directory
β”‚   β”œβ”€β”€ pdf_files/             # PDF documents for processing
β”‚   β”œβ”€β”€ text_files/            # Text documents
β”‚   β”‚   β”œβ”€β”€ sample1.txt
β”‚   β”‚   └── sample2.txt
β”‚   └── vector_store/          # ChromaDB vector store
β”‚
└── notebook/                   # Jupyter notebooks
    β”œβ”€β”€ pdf_loader.ipynb       # PDF loading examples
    └── document.ipynb         # Document querying examples

Prerequisites

Installation

Using uv (Recommended)

# Clone the repository
git clone https://github.com/yourusername/traditional-rag.git
cd traditional-rag

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

Using pip

git clone https://github.com/yourusername/traditional-rag.git
cd traditional-rag

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Configuration

  1. Create environment file:

    cp .env.example .env
  2. Add your Groq API key:

    GROQ_API_KEY=your_api_key_here

The .env file will be automatically loaded by python-dotenv.

Usage

Running the Main Application

python main.py

Using Jupyter Notebooks

  1. PDF Loader Notebook: Learn how to load and process PDF files

    jupyter notebook notebook/pdf_loader.ipynb
  2. Document Query Notebook: Explore RAG querying with your documents

    jupyter notebook notebook/document.ipynb

Basic RAG Pipeline Example

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import CharacterTextSplitter
from langchain_groq import ChatGroq

# Load documents
loader = PyPDFLoader("path/to/document.pdf")
documents = loader.load()

# Split documents
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = splitter.split_documents(documents)

# Create vector store
vectorstore = Chroma.from_documents(docs, embedding=embeddings)

# Query with RAG
llm = ChatGroq(model="mixtral-8x7b-32768")
response = llm.invoke("Your question here")

Key Dependencies

  • langchain & langchain-community: RAG framework and integrations
  • chromadb: Vector database for storing embeddings
  • faiss-cpu: Efficient similarity search
  • sentence-transformers: Creating embeddings from text
  • groq: Fast language model API
  • pymupdf & pypdf: PDF document loading
  • python-dotenv: Environment variable management

Environment Variables

Create a .env file in the root directory:

GROQ_API_KEY=your_groq_api_key_here

Development

Setting up for development:

# Install in editable mode with dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black .

# Lint
pylint src/

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

  • Add more embedding model options
  • Implement query result caching
  • Add support for more document formats (Word, Excel, etc.)
  • Implement document metadata filtering
  • Add comprehensive test suite
  • Create CLI tool for document ingestion

Troubleshooting

API Key Issues

Vector Store Issues

  • Delete the data/vector_store/ directory to reset ChromaDB
  • Ensure you have write permissions to the data directory

Dependency Issues

  • Update all dependencies: uv pip install --upgrade -r requirements.txt
  • Clear cache: rm -rf .venv and reinstall

Resources

Support

For issues and questions:


Made with ❀️ for the RAG community

About

Traditional vanilla RAG using LangChain, ChromaDB, and GroQ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors