Git Archaeologist

title	Git Archaeologist
emoji	🚀
colorFrom	blue
colorTo	purple
sdk	docker
app_port	7860
pinned	false

Git Archaeologist

A RAG-based tool that lets you query the history of any Git repository in plain English — not the current code, but why it became what it is.

Live Demo

Hugging Face Space:
https://huggingface.co/spaces/kartik0905/git-archaeologist

Architecture

graph TD
    A[User / Browser] -->|Paste repo URL + question| B[Streamlit UI - app.py]
    B -->|run_indexing| C[indexer.py]

    subgraph Indexing Pipeline
        C -->|inject PAT via urlparse| D[Git Repo - public or private]
        D -->|raw commits + diffs| E[miner.py - generator]
        E -->|hunk-chunked commit docs| F[(ChromaDB - Vector Store)]
    end

    B -->|ask query + history + filters| G[qa.py]

    subgraph Query Pipeline
        G -->|rewrite follow-up with history| H[Groq LLM - Query Rewriter]
        H -->|standalone search query| I[ChromaDB - top 10 by similarity]
        I -->|author or timestamp filter| I
        I -->|candidate commits| J[CrossEncoder - ms-marco-MiniLM-L-6-v2]
        J -->|top 3 reranked commits| K[Groq LLM - llama-3.3-70b-versatile]
    end

    K -->|streamed answer with citations| B
    B -->|render response| A
    B -->|on request| L[PDF Audit Report]

Walkthrough

Screen.Recording.2026-06-04.at.1.12.56.PM.mov

Key engineering decisions

Query rewriting
Follow-up questions like "who worked on it?" are rewritten into standalone search queries before hitting ChromaDB. The LLM uses conversation history to resolve vague references — so retrieval is context-aware, not just keyword-based.

Two-stage retrieval (ChromaDB + reranker)
Pure vector similarity returns commits that look related. The cross-encoder (ms-marco-MiniLM-L-6-v2) re-scores each result against your exact question and keeps only the top 3. Precision over recall.

Hunk-based diff chunking
Diffs are split at @@ boundaries, not truncated at a character limit. The LLM sees complete, meaningful code hunks — so it can say "the value changed from 5 to 10" rather than a vague summary.

Timestamp-based metadata filtering
Author and date range filters are pushed down to ChromaDB where clauses using Unix timestamps for accurate numeric comparison. If a filter returns no commits, the tool says so explicitly instead of silently falling back to unfiltered results.

Multi-turn conversation
Last 5 turns of chat history are injected into every LLM call. Follow-up questions work without repeating context.

Private repo support
GitHub PAT is injected into the clone URL at runtime via urlparse — never written to disk, wiped on reset.

O(1) memory mining
The commit iterator is a Python generator. Shallow cloning (--depth) keeps fetches fast for recent history queries.

Stack

Layer	Technology
UI	Streamlit
LLM	Groq — `llama-3.3-70b-versatile`
Vector DB	ChromaDB
Embeddings	`all-MiniLM-L6-v2`
Reranker	`cross-encoder/ms-marco-MiniLM-L-6-v2`
Git parsing	GitPython
Package manager	uv
Deployment	Hugging Face Spaces (Docker)

Project structure

├── app.py          # Streamlit UI
├── indexer.py      # Clone + batch-index into ChromaDB
├── qa.py           # Query rewriting → Retrieval → reranking → LLM call
├── miner.py        # Generator-based diff parser
├── utils.py        # PDF export, cleanup, token injection
├── tests/          # 49 tests, no external services required
├── Dockerfile
├── requirements.txt
└── README.md

Installation

git clone https://github.com/kartik0905/git-archaeologist.git
cd git-archaeologist
uv pip install -r requirements.txt
cp .env.example .env   # add your Groq API key
streamlit run app.py

The first query downloads the reranker model (~90MB) and caches it locally. All subsequent runs load from cache.

Deployment

The application is deployed on Hugging Face Spaces using Docker.

Live Demo:
https://huggingface.co/spaces/kartik0905/git-archaeologist

Known limitations

Shallow clones may miss commits outside the selected depth window.
Author filtering requires an exact Git config name match.
Reranker runs on CPU — adds ~1–2 seconds per query on large result sets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Git Archaeologist

Live Demo

Architecture

Walkthrough

Key engineering decisions

Stack

Project structure

Installation

Deployment

Known limitations

Built by Kartik Garg

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
indexer.py		indexer.py
miner.py		miner.py
pyproject.toml		pyproject.toml
qa.py		qa.py
requirements.txt		requirements.txt
utils.py		utils.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Git Archaeologist

Live Demo

Architecture

Walkthrough

Key engineering decisions

Stack

Project structure

Installation

Deployment

Known limitations

Built by Kartik Garg

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages