Auto Logging Summarization with RAG

Overview

This project implements a Retrieval-Augmented Generation (RAG) pipeline to automatically summarize user logs collected within a one-hour window.
The system analyzes log entries for each IP address, produces a concise summary, and classifies user behavior into:

No Harm
Harm

RAG enhances contextual understanding by combining relevant document retrieval with large language model (LLM) generation.

Architecture

Input: Hourly log data grouped by IP.
Preprocessing: Metadata extraction and OCR (via Docling) for PDF or text-based logs.
Retriever: Hybrid search combining
- Metadata filtering (topic, MITRE IDs)
- Keyword search (BM25 via Whoosh)
- Semantic vector search (FAISS + SentenceTransformer embeddings)
- Fusion with Reciprocal Rank Fusion (RRF)
- De-duplication with Maximal Marginal Relevance (MMR)
LLM Summarization: LangChain-compatible LLM (e.g., Ollama backend) summarizes logs and predicts Harm/No Harm.
Evaluation:
Precision, Recall, and F1-score are computed by comparing the model outputs with ChatGPT (GPT-5) results as the reference standard.

Results

Metric	Without RAG (qwen3:8b)	With RAG
Precision	~0.86	~0.98
Recall	~0.55	~0.55 (similar)
F1-score	~0.68	~0.70

Scores are close overall, but qualitative analysis shows:
- Non-RAG tends to produce more “Undetermined” or “No Harm” responses.
- RAG more accurately detects “Harm” cases when logs contain domain-specific or context-heavy information.

Engineering Insight

If logs include technical or domain-specific content with complex structure → RAG provides real benefit through better context retrieval.
If logs are short, patterned, or repetitive → non-RAG is sufficient and more efficient in latency and cost.

Requirements

pip install -r requirements.txt

Output

Each output includes:

One-hour log summary.
Label: No Harm or Harm.
Metadata references for transparency.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
indexes		indexes
out_all		out_all
src		src
.gitignore		.gitignore
Final_SSH_dataset_text.pkl		Final_SSH_dataset_text.pkl
README.md		README.md
requirements.txt		requirements.txt
selected_raw_logs.tsv		selected_raw_logs.tsv
selected_raw_logs.with_pred.tsv		selected_raw_logs.with_pred.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto Logging Summarization with RAG

Overview

Architecture

Results

Engineering Insight

Requirements

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Auto Logging Summarization with RAG

Overview

Architecture

Results

Engineering Insight

Requirements

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages