This project implements a Retrieval-Augmented Generation (RAG) pipeline to automatically summarize user logs collected within a one-hour window.
The system analyzes log entries for each IP address, produces a concise summary, and classifies user behavior into:
- No Harm
- Harm
RAG enhances contextual understanding by combining relevant document retrieval with large language model (LLM) generation.
- Input: Hourly log data grouped by IP.
- Preprocessing: Metadata extraction and OCR (via Docling) for PDF or text-based logs.
- Retriever: Hybrid search combining
- Metadata filtering (topic, MITRE IDs)
- Keyword search (BM25 via Whoosh)
- Semantic vector search (FAISS + SentenceTransformer embeddings)
- Fusion with Reciprocal Rank Fusion (RRF)
- De-duplication with Maximal Marginal Relevance (MMR)
- LLM Summarization: LangChain-compatible LLM (e.g., Ollama backend) summarizes logs and predicts Harm/No Harm.
- Evaluation:
Precision, Recall, and F1-score are computed by comparing the model outputs with ChatGPT (GPT-5) results as the reference standard.
| Metric | Without RAG (qwen3:8b) | With RAG |
|---|---|---|
| Precision | ~0.86 | ~0.98 |
| Recall | ~0.55 | ~0.55 (similar) |
| F1-score | ~0.68 | ~0.70 |
- Scores are close overall, but qualitative analysis shows:
- Non-RAG tends to produce more “Undetermined” or “No Harm” responses.
- RAG more accurately detects “Harm” cases when logs contain domain-specific or context-heavy information.
- If logs include technical or domain-specific content with complex structure → RAG provides real benefit through better context retrieval.
- If logs are short, patterned, or repetitive → non-RAG is sufficient and more efficient in latency and cost.
pip install -r requirements.txtEach output includes:
- One-hour log summary.
- Label: No Harm or Harm.
- Metadata references for transparency.