DocRevAI is an AI-powered document analysis and question-answering system that enables users to upload PDF documents and ask questions about their content. The application extracts text from PDFs, processes the content, retrieves relevant information using TF-IDF similarity search, and generates context-aware responses using local Large Language Models (LLMs) through Ollama.
- PDF document ingestion
- Text extraction and preprocessing
- Intelligent text chunking
- TF-IDF based document retrieval
- Context-aware question answering
- Local AI inference using Ollama
- Modular and scalable architecture
- Comprehensive logging and error handling
PDF Upload
↓
Text Extraction
↓
Text Cleaning
↓
Chunk Creation
↓
TF-IDF Vectorization
↓
Similarity Search
↓
Context Retrieval
↓
Ollama LLM
↓
Generated Response
- Python 3.12+
- Ollama
- Scikit-learn
- PyPDF2 / PDF Processing Libraries
- Logging Module
- TF-IDF Vectorization
- Cosine Similarity Search
- Local Language Models via Ollama
- uv
- Git
- GitHub
- Jira
- Pytest
This project uses uv for package management and virtual environment handling, providing faster dependency resolution and installation compared to traditional pip-based workflows.
DocRevAI/
│
├── docrevai/
│ ├── scripts/
│ │ ├── clean_text.py
│ │ ├── create_chunks.py
│ │ ├── similarity_finder.py
│ │ ├── tf_idf.py
│ │ └── ...
│ │
│ ├── logging/
│ │ └── logger.py
│ │
│ └── ...
│
├── tests/
│
├── logs/
│
├── pyproject.toml
│
├── uv.lock
│
└── README.md
git clone https://github.com/Hanan-Nawaz/DocRevAI
cd DocRevAIcurl -LsSf https://astral.sh/uv/install.sh | shor
pip3 install uvuv venvActivate environment:
macOS/Linux:
source .venv/bin/activateWindows:
.venv\Scripts\activateuv syncuv run main.pyRun all tests:
uv run pytestRun tests with coverage:
uv run pytest --cov=docrevaiDocRevAI includes centralized logging for:
- Error tracking
- Debugging
- System monitoring
- Runtime diagnostics
Logs are stored in the project's log directory.
The project follows Agile development practices and uses Jira for:
- Sprint planning
- Task management
- Issue tracking
- Feature development tracking
- Semantic search using embeddings
- Vector databases (FAISS / ChromaDB)
- Multi-document support
- Web-based user interface
- Conversation memory
- Document summarization
- Citation and source highlighting
- Hybrid retrieval (TF-IDF + Embeddings)
Abdul Hanan Nawaz
This project is intended for educational and portfolio purposes.