Skip to content

amarskdev/RAG-based-PDF-Question-Answering-Bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

🤖 RAG-Powered PDF Question Answering Bot

Ask questions from any PDF — powered by LangChain · IBM Watsonx · ChromaDB

Python LangChain IBM Watsonx Gradio License

Upload any PDF. Ask questions in plain English. Get precise, context-grounded answers — powered by a production-grade RAG pipeline using IBM Watsonx enterprise LLM.


🎯 What This Project Does

Most LLMs hallucinate when asked about documents they haven't seen. This bot eliminates that problem by grounding every answer in retrieved content from your actual PDF — no hallucinations, no guesswork.

PDF Upload → Chunking → Embedding (Watsonx slate-30m) → ChromaDB
                                                              │
User Query → Semantic Retrieval → RetrievalQA → IBM Watsonx LLM → Answer

✨ Features

  • 📄 Upload any PDF — no preprocessing required
  • ✂️ Automatic chunking — splits documents into semantically meaningful segments
  • 🔢 Enterprise embeddings — IBM Watsonx slate-30m for high-quality vector representations
  • 🗄️ ChromaDB vector store — fast, local semantic search over document chunks
  • 🧠 RetrievalQA pipeline — LangChain orchestrates retrieval → generation end-to-end
  • 🖥️ Gradio UI — clean, real-time Q&A interface, zero frontend code

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                    INDEXING PIPELINE                     │
│                                                          │
│  PDF File → PyPDF Loader → Text Splitter → Chunks       │
│                                    │                     │
│                          Watsonx Embeddings              │
│                                    │                     │
│                             ChromaDB Store               │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                    QUERY PIPELINE                        │
│                                                          │
│  User Query → Embed Query → Semantic Search (Chroma)    │
│                                    │                     │
│                          Relevant Chunks                 │
│                                    │                     │
│                    IBM Watsonx LLM (RetrievalQA)        │
│                                    │                     │
│                         Grounded Answer                  │
└─────────────────────────────────────────────────────────┘

⚙️ Tech Stack

Layer Technology
RAG Orchestration LangChain
LLM IBM Watsonx (enterprise)
Embeddings IBM Watsonx slate-30m
Vector Store ChromaDB
UI Gradio
Runtime Python 3.8+
Integration HuggingFace Hub

🚀 Getting Started

1. Clone the repository

git clone https://github.com/amarskdev/RAG-based-PDF-Question-Answering-Bot.git
cd RAG-based-PDF-Question-Answering-Bot

2. Create and activate virtual environment

python -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Configure IBM Watsonx credentials

Add your credentials to a .env file:

WATSONX_API_KEY=your_api_key_here
WATSONX_PROJECT_ID=your_project_id_here
WATSONX_URL=https://us-south.ml.cloud.ibm.com

5. Run the app

python app.py

Open your browser at http://127.0.0.1:7860, upload a PDF, and start asking questions.


📁 Project Structure

RAG-based-PDF-Question-Answering-Bot/
│
├── app.py                  # Main Gradio app + RAG pipeline
├── requirements.txt        # Python dependencies
├── .env.example            # Credential template (safe to commit)
├── .gitignore              # Excludes .env, venv, __pycache__
└── README.md

🌍 Real-World Use Cases

  • 📚 Research assistance — query academic papers, reports, whitepapers
  • ⚖️ Legal document review — extract clauses and obligations from contracts
  • 🏢 Enterprise knowledge base — make internal docs instantly searchable
  • 📋 Compliance workflows — audit documents against policies
  • 🎓 Study tool — ask questions from textbooks and lecture notes

🔭 Roadmap

  • Multi-PDF support (query across multiple documents)
  • Conversational memory (multi-turn Q&A)
  • Source citation with page numbers
  • REST API via FastAPI + Docker deployment
  • Fine-tuning embeddings on domain-specific corpora

Built with enterprise-grade AI (IBM Watsonx) — not just another OpenAI wrapper.


🤝 Connect With Me

👤 About the Author

Amar Kumar
Senior Backend Engineer · IBM Certified AI Engineer

LinkedIn GitHub Gmail LeetCode Instagram Credly

If you found this project useful, consider giving it a ⭐ — it means a lot!

About

This project is a **PDF-based QA Bot** built using **LangChain**, **IBM Watsonx LLM & Embeddings**, and **Chroma** for vector storage. The bot uses a **Retrieval-Augmented Generation (RAG)** pipeline to answer questions based on the content of uploaded PDF documents. Users can upload PDFs and ask queries, and the bot will return answers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages