A comprehensive suite of Generative AI applications built with LangChain, leveraging multiple LLM providers and vector databases for various use cases including RAG systems, chatbots, and data retrieval applications.
Author: @dctn
Version: 0.1.0
- Project Overview
- Key Features
- Project Structure
- Installation
- Configuration
- Applications
- Notebooks
- Dependencies
- File & Folder Details
- Usage
- Contributing
This project is a comprehensive exploration of Generative AI technologies, focusing on:
- Retrieval-Augmented Generation (RAG) - Systems that retrieve and augment responses using external data sources
- LLM Integration - Seamless integration with multiple LLM providers (Groq, Google Generative AI, OpenRouter)
- Vector Databases - Efficient storage and retrieval of embeddings using ChromaDB and FAISS
- Web Scraping & Data Processing - Processing of unstructured data from URLs and documents
- Exam Question Retrieval - TNPSC (Tamil Nadu Public Service Commission) exam question matching system
- Real-time Monitoring - Amazon product price monitoring and notification system
- π€ Multi-Provider LLM Support - Groq, Google Generative AI, OpenRouter, LangChain OpenAI
- π RAG System - Retrieval-Augmented Generation for context-aware responses
- ποΈ Vector Database Integration - ChromaDB and FAISS for efficient similarity search
- π Web Interface - Streamlit-based interactive applications
- π Document Processing - PDF, text, and web content loading and chunking
- π URL Web Scraping - Dynamic content extraction from websites using BeautifulSoup
- πΎ Multiple Data Sources - Hugging Face datasets, ChromaDB, FAISS
- π Similarity Search - Find semantically similar questions and documents
- βοΈ HuggingFace Embeddings - sentence-transformers for semantic similarity
GEN AI/
βββ π README.md # Project documentation (this file)
βββ π main.py # Main entry point
βββ π pyproject.toml # Project configuration and dependencies
βββ π controller_price_alter.py # Amazon price monitoring tool
βββ π starter.ipynb # Starter notebook for experimentation
βββ π .python-version # Python version specification
βββ π¦ .venv/ # Virtual environment
βββ π .env # Environment variables (API keys)
β
βββ π apps/ # Streamlit applications
β βββ simple_llm.py # URL Q&A chatbot application
β βββ TNPSC_retrevial_system.py # TNPSC exam question retrieval system
β
βββ π notebook/ # Jupyter notebooks for experimentation
β βββ starter.ipynb # Initial setup and exploration
β βββ simple__llm_app.ipynb # Simple LLM app development
β βββ chatbot_history.ipynb # Chatbot with conversation history
β βββ tnpsc_RAG.ipynb # TNPSC RAG system notebook
β βββ π chroma_exam_db/ # ChromaDB storage for exam questions
β β βββ chroma.sqlite3
β β βββ 7ecb70f5-a47b-43fd-97f2-22aa8a94057e/
β βββ π chromadb_text_docs/ # ChromaDB storage for text documents
β β βββ chroma.sqlite3
β β βββ ac7dcb6c-ecd4-4126-909f-f7d5914ba11d/
β βββ π faiss/ # FAISS vector index
β βββ faiss_index.faiss
β
βββ π files/ # Data and resource files
β βββ artificial_intelligence.txt # AI reference text
β βββ index.html # HTML index file
β βββ index_raw.txt # Raw index text
β βββ openapi.json # OpenAPI specification
β βββ test.txt # Test data
β βββ gigr.pdf # PDF document
β βββ π tnpsc/ # TNPSC dataset
β βββ π train/ # Training dataset
β βββ data-00000-of-00001.arrow # Arrow format data
β βββ dataset_info.json # Dataset metadata
β βββ state.json # Dataset state
β βββ dataset_dict.json # Dataset dictionary
β
βββ π chromadb_text_docs/ # ChromaDB storage (root level)
β βββ chroma.sqlite3 # SQLite database
β βββ 757c8160-e859-466d-b81a-5f900849f744/ # Collection storage
β
βββ π artifacts/ # Generated artifacts
β βββ simple_llm.txt # Output from simple LLM
β
βββ π perfect_Resume_i.pdf # Resume document
βββ π perfect_Resume_2.pdf # Resume document
βββ π resume.pdf # Resume document
βββ π a_my_resume.pdf # Resume document
βββ π full_stack.pdf # Full-stack documentation
βββ π .env # Environment variables (API keys)
βββ π .gitignore # Git ignore rules
βββ π pyproject.toml # Python project manifest
βββ π¦ uv.lock # Dependency lock file
βββ π .idea/ # PyCharm IDE configuration
- Python 3.12 or higher
- pip or uv package manager
- Virtual environment (recommended)
- Clone the repository:
git clone <repository-url>
cd "GEN AI"- Create virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -e .
# Or using uv:
uv sync- Configure environment variables:
Create a
.envfile in the root directory with the following:
HUGGING_FACE_TOKEN=your_hugging_face_token
GOOGLE_API_KEY=your_google_api_key
GROQ_KEY=your_groq_api_key
OPEN_ROUTER=your_open_router_keyThis project supports multiple LLM providers. Set up the following APIs as needed:
| Provider | Environment Variable | Purpose |
|---|---|---|
| Groq | GROQ_KEY |
Fast inference with open models |
| Google Generative AI | GOOGLE_API_KEY |
Gemini model access |
| OpenRouter | OPEN_ROUTER |
Multi-model routing |
| HuggingFace | HUGGING_FACE_TOKEN |
Model and dataset access |
This project requires Python 3.12 or higher. Check .python-version file for version specification.
Purpose: Web-based Q&A system that scrapes URLs and answers questions based on the retrieved content.
Features:
- URL content scraping using Selenium/BeautifulSoup
- Recursive text chunking (500 chars, 100 overlap)
- ChromaDB vector storage
- HuggingFace embeddings (sentence-transformers/all-MiniLM-L6-v2)
- Streamlit web interface
- Session state management for efficient retrieval
How it works:
- User inputs a website URL
- Application scrapes and processes the content
- Text is split into chunks and converted to embeddings
- Chunks are stored in ChromaDB vector database
- User asks questions β retrieved relevant chunks β LLM generates answer
Run:
streamlit run apps/simple_llm.pyStack:
- LangChain with Groq LLM
- WebBaseLoader for URL content
- RecursiveCharacterTextSplitter for chunking
- ChromaDB vector database
- HuggingFace embeddings
Purpose: Semantic search system for Tamil Nadu Public Service Commission (TNPSC) exam questions.
Features:
- Loads TNPSC exam dataset from Hugging Face
- Feature engineering to extract answers from multiple choice options
- Semantic similarity search using embeddings
- Configurable number of similar questions (3-10)
- Similarity score filtering (threshold: 0.9)
- Metadata display: options, answers, exam paper info, download links
How it works:
- Load TNPSC exam questions dataset from Hugging Face
- Extract answers from multiple-choice options
- Convert questions to vector embeddings
- Store in ChromaDB vector database
- User enters a question β finds semantically similar exam questions
- Displays top N similar questions with full metadata
Dataset: snegha24/Tamil_tnpscExam from Hugging Face
Run:
streamlit run apps/TNPSC_retrevial_system.pyStack:
- Hugging Face datasets API
- Pandas for data manipulation
- HuggingFace embeddings
- ChromaDB vector storage
- Streamlit UI
Metadata tracked:
- Options
- Correct answer
- Original question number
- Source link
- Exam paper filename
Purpose: Monitors Amazon product prices and sends email notifications when prices change.
Features:
- Web scraping with BeautifulSoup
- Price tracking and comparison
- Email notifications via Resend API
- Handles multiple CSS selectors for price elements
- Persistent price storage
How it works:
- Fetches product page from Amazon URL
- Extracts current price using CSS selectors
- Compares with last known price
- Sends email if price changes
- Updates stored price
Configuration:
- Update
urlvariable with Amazon product link - Set
TO_EMAILfor notifications - Configure Resend API key in
.env
Run:
python controller_price_alter.pyTechnologies:
- BeautifulSoup for HTML parsing
- Requests for HTTP requests
- Resend for email service
- Lxml for HTML processing
Initial exploration notebook featuring:
- PDF loading with LangChain
- Resume parsing
- Job description processing
- LLM usage tracking and cost analysis
- Integration with HuggingFace embeddings
Use case: Resume-JD matching and analysis
Development and testing notebook for the Simple LLM application with iterative refinements.
Builds a conversational chatbot with:
- Conversation history management
- Context maintenance across turns
- LLM integration with memory
Development and testing of TNPSC exam question retrieval system with RAG capabilities.
| Package | Version | Purpose |
|---|---|---|
| langchain | >=1.2.7 | LLM framework |
| langchain-groq | >=1.1.1 | Groq integration |
| langchain-google-genai | >=4.2.0 | Google Generative AI |
| langchain-openai | >=1.1.7 | OpenAI integration |
| langchain-huggingface | >=1.2.0 | HuggingFace models |
| langchain-chroma | >=1.1.0 | ChromaDB integration |
| chromadb | >=1.4.1 | Vector database |
| faiss-cpu | >=1.13.2 | FAISS vector index |
| sentence-transformers | >=5.2.2 | Embeddings model |
| streamlit | >=1.54.0 | Web UI framework |
| jupyter | >=1.1.1 | Notebook environment |
| pandas | (via datasets) | Data manipulation |
| datasets | >=4.6.1 | Hugging Face datasets |
| bs4 | >=0.0.2 | Web scraping |
| pypdf | >=6.6.1 | PDF processing |
| requests | >=2.32.5 | HTTP library |
| python-dotenv | >=1.2.1 | Environment variables |
| lxml | >=6.0.2 | XML/HTML processing |
See pyproject.toml for complete dependency list.
| File | Purpose |
|---|---|
| main.py | Entry point script with basic hello world |
| controller_price_alter.py | Amazon price monitoring utility |
| pyproject.toml | Project metadata and dependencies |
| README.md | This documentation |
| .env | Environment variables (API keys) - Do NOT commit |
| .python-version | Python version specification |
| .gitignore | Git ignore rules |
| uv.lock | Dependency lock file for reproducibility |
| starter.ipynb | Initial exploration notebook |
Streamlit applications ready for deployment:
- simple_llm.py: URL-based Q&A chatbot (50 lines)
- TNPSC_retrevial_system.py: Exam question retrieval system (90+ lines)
Jupyter notebooks for development and experimentation:
- Development notebooks for each major feature
- Experiment tracking and prototyping
- Contains embedded ChromaDB and FAISS indexes
Data and resource files:
- tnpsc/: TNPSC dataset (Arrow format, ~1MB)
- Text files for testing and reference
- HTML and JSON specifications
- PDF documents (full_stack.pdf, gigr.pdf)
Generated outputs from applications:
- simple_llm.txt: Output from simple LLM processing
ChromaDB vector store at root level for persistent text embeddings.
Multiple resume files for testing and matching:
- perfect_Resume_i.pdf
- perfect_Resume_2.pdf
- resume.pdf
- a_my_resume.pdf
# Terminal 1: Start the app
streamlit run apps/simple_llm.py
# In browser:
# 1. Enter URL: https://example.com
# 2. Wait for scraping
# 3. Ask question: "What is the main topic?"
# 4. Get AI-generated answer based on page contentstreamlit run apps/TNPSC_retrevial_system.py
# In browser:
# 1. Enter question: "What is capital of India?"
# 2. Select number of similar questions (3-10)
# 3. View similar exam questions with:
# - Full question text
# - All options
# - Correct answer
# - Similarity score
# - Original exam paper link# Edit controller_price_alter.py with:
# - Product URL
# - Email address
# - API keys in .env
python controller_price_alter.pyfrom langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import WebBaseLoader
# Load and embed URL content
loader = WebBaseLoader("https://example.com")
docs = loader.load()
# Create embeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Store in ChromaDB
db = Chroma.from_documents(docs, embeddings)
# Search
results = db.similarity_search("your question")- Create new
.pyfile in/appsdirectory - Use Streamlit for UI:
import streamlit as st - Integrate with LangChain components
- Follow existing patterns for LLM and vector DB setup
- Create
.ipynbfile in/notebookdirectory - Use same patterns as existing notebooks
- Document purpose and use case clearly
Use the provided notebooks to test:
- Embedding quality
- Retrieval performance
- Answer generation quality
- Cost tracking with
llm_usage_tracker
- Never commit
.envfile with real API keys - Keep API keys in
.envfile locally only - Use environment variables for deployment
- Regenerate keys if accidentally exposed
- Use restricted API key scopes when possible
- Default:
sentence-transformers/all-MiniLM-L6-v2 - Dimension: 384
- Use: Semantic similarity search
- Groq:
openai/gpt-oss-20b(fast, cost-effective) - Google: Gemini (via langchain-google-genai)
- OpenAI: GPT-3.5, GPT-4 (via OpenRouter)
- ChromaDB: Main vector store for embeddings
- FAISS: Alternative high-performance index
This project includes LLM usage tracking via llm_usage_tracker package:
from llm_usage_tracker import setup_patch, print_usage_costs
setup_patch() # Enable tracking
# ... run your code ...
print_usage_costs() # View costsContributions are welcome! To contribute:
- Fork the repository
- Create feature branch:
git checkout -b feature/new-feature - Commit changes:
git commit -am 'Add new feature' - Push to branch:
git push origin feature/new-feature - Submit pull request
This project is created and maintained by @dctn.
- Clear
.sqlite3files if schema conflicts occur - Ensure embeddings dimension matches stored vectors
- Verify
.envfile is in root directory - Check
load_dotenv()is called before accessing keys - Ensure keys have required permissions
- Download embedding model manually if internet issues
- Check HuggingFace token validity
- Verify sufficient disk space for model files
- Clear cache:
streamlit cache clear - Use
--logger.level=debugfor debugging - Check port 8501 availability
- LangChain Documentation
- ChromaDB Documentation
- Streamlit Documentation
- HuggingFace Transformers
- FAISS Documentation
For issues, questions, or suggestions:
- Create an issue on GitHub
- Contact: @dctn
Last Updated: June 28, 2024
Status: Active Development π