Semantic search and spatial navigation for Git repositories — so AI coding agents orient in one turn and retrieve exactly what they need.
git-semantic parses every tracked file with tree-sitter, generates vector embeddings per chunk, and stores them on a dedicated orphan Git branch. At index time it also builds a spatial map of the codebase — grouping files into semantically coherent subsystems using Leiden community detection, labeling them by their key functions, and tracking cross-file call edges.
Search is hybrid: BM25 (SQLite FTS5) + semantic embeddings + graph proximity, merged via Reciprocal Rank Fusion. Exact identifier lookups score higher when they appear in both ranked lists; files connected via call edges to top results are boosted automatically.
Good agents don't need to explore — they need to know where to look and how much to read.
git-semantic gives agents a spatial model of the codebase. Instead of searching and accumulating, an agent can orient with map, read a file's structure with get --mode outline (~96% token reduction), pull the declaration with --mode signatures (~86% reduction), and fetch the exact body with get file:start-end only when it needs to. A well-structured session stays flat because the agent fetches surgically from the start rather than accumulating everything that matched.
The index lives on a Git branch. One person indexes, the whole team benefits — no re-embedding, no API keys per developer. The map is shared state: every agent session starts with the same orientation, not a cold rediscovery of the codebase.
git-semantic benchmark measures this concretely on your own repo: token savings per language, read mode comparison, and a navigation replay that shows grep precision vs map+outline+get precision across sampled subsystems.
See BENCHMARKS.md for results on real codebases.
main branch semantic branch (orphan)
────────────────── ──────────────────────────
src/main.rs → src/main.rs ← [{start_line, end_line, text, embedding}, ...]
src/db.rs → src/db.rs ← [{...}, ...]
src/chunking/mod.rs → src/chunking/mod.rs
.semantic-map.json ← subsystems + edges
git-semantic indexparses all tracked files, embeds each chunk, clusters files into subsystems using Leiden community detection, builds the spatial map, and commits everything to thesemanticorphan branch.git push origin semanticshares the embeddings and map with the team.- Everyone else runs
git fetch origin semantic+git-semantic hydrateto populate their local SQLite search index (vector + FTS5) — no re-embedding needed. - Agents use
mapto orient,get --mode outlineto read cheaply,get file:start-endto retrieve exactly, andgreponly when the map is insufficient.
→ Quickstart — install, index, and share in under 5 minutes → Navigation guide — map / get / grep workflow with examples → Repo health — reading the heatmap, drilling into communities → CI setup — keep the index fresh automatically → MCP setup — connect to Claude Code, Cursor, Windsurf
cargo install gitsemPrerequisites: Rust 1.65+, Git 2.0+
Parses and embeds all tracked files, builds the spatial map, and commits to the semantic branch.
- First run: full index
- Subsequent runs: incremental — only changed files are re-embedded
- Respects
.gitignore - Skips binary files
Reads the semantic branch and populates the local .git/semantic.db index. Fetches origin/semantic first, falls back to local.
Show the spatial map of the codebase, or find the subsystem relevant to a task. Subsystems are built by Leiden community detection — files are grouped by embedding similarity, not filesystem location, so semantically related files cluster together even in flat repos.
git-semantic map
# → lists all subsystems with key functions and entry points
git-semantic map "where does embedding dispatch happen"
# → returns the most relevant subsystem with file locations and call edgesOutput:
## embeddings — gemma: GemmaProvider, EmbeddingConfig, cache_dir, TextEmbedding
entry points:
src/embed.rs (via create_provider, EmbeddingConfig)
src/main.rs (via EmbeddingConfig, load_or_default)
src/embeddings/gemma.rs:1-45
src/embeddings/config.rs:0-47
...
Retrieve a file by path or a specific chunk by line range.
File-level retrieval (three modes powered by tree-sitter):
git-semantic get src/db.rs --mode outline # name + line range per chunk — cheapest
git-semantic get src/db.rs --mode signatures # full declaration, no body
git-semantic get src/db.rs # full content of all chunksOutput includes callers — files outside this one that reference it via edges:
// src/db.rs
// callers:
// src/main.rs (via hydrate_from_branch, grep_semantic)
L1-126 init_with_dimension
L128-140 clear
L142-161 insert_subsystem
L463-497 search_hybrid
Chunk-level retrieval (exact or overlapping range):
git-semantic get src/embed.rs:9-17
git-semantic get src/embeddings/config.rs:0-100 # returns all overlapping chunks merged| mode | mechanism | typical savings vs raw |
|---|---|---|
outline |
tree-sitter extracts identifier name only | ~96% |
signatures |
tree-sitter cuts at body node, keeps full declaration | ~86% |
full (default) |
all chunks concatenated | ~4% |
Search code using three-signal hybrid search: BM25 (FTS5) + semantic embeddings + graph proximity, merged via Reciprocal Rank Fusion. Files connected via call edges to top semantic results are boosted automatically. Higher score = more relevant; a result scoring 2x the next is an unambiguous match.
git-semantic grep "how incoming requests are validated"
git-semantic grep "error propagation across async boundaries" -n 5
git-semantic grep "ExactIdentifierName"Show a cohesion/coupling heatmap of all semantic communities. Use --community to drill into a specific one — shows files, top dependents, and top dependencies.
git-semantic health
git-semantic health --community "database"Measure token savings across read modes for every indexed file, and replay actual navigation queries to compare grep vs map+get strategies.
git-semantic benchmark
git-semantic benchmark --jsonOutput includes:
- Token savings by language (outline / signatures vs raw)
- Read mode comparison table
- Session cost simulation
- Navigation comparison: grep precision vs map+outline+get precision across sampled subsystems
Starts the MCP server (JSON-RPC over stdio). Exposes map, get, grep, and health as tools to any MCP-compatible client — Claude Code, Cursor, Codex, Windsurf, and others.
git-semantic mcpRegister it in your client's config:
Claude Code (.claude/settings.json):
{
"mcpServers": {
"git-semantic": {
"command": "git-semantic",
"args": ["mcp"]
}
}
}Cursor (.cursor/mcp.json):
{
"mcpServers": {
"git-semantic": {
"command": "git-semantic",
"args": ["mcp"]
}
}
}Configure the embedding provider. Stored in .git/config, per-repository.
git-semantic config --list
git-semantic config provider openai
git-semantic config provider gemmaOnce registered as an MCP server, any client can call the tools directly. The intended workflow:
Step 1 — orient
git-semantic map "natural language description of the task"Read the output. If it names the function or file needed — go to step 2 immediately.
Step 2 — read cheaply
git-semantic get src/file.rs --mode outline # names + line ranges, ~96% token reduction
git-semantic get src/file.rs --mode signatures # declarations only, ~86% token reductionStart with outline. If the declaration alone is enough, stop. If you need the body, go to step 3.
Step 3 — retrieve exactly
git-semantic get src/file.rs:start-endUse the line ranges from the outline output directly. Maximum 3 calls per task.
Step 4 — search (last resort)
git-semantic grep "natural language query"
git-semantic grep "ExactIdentifierName"Use when the map was genuinely insufficient. Search is hybrid (BM25 + semantic + graph proximity). For exact identifier lookups prefer grep over map — BM25 will find it precisely.
Orient once, read cheaply, retrieve exactly, never re-search what the map already answered.
Indexing only needs to happen once. Push the semantic branch and the whole team benefits — no API keys, no re-embedding.
# Once, by whoever has an API key
git-semantic index
git push origin semantic
# Everyone else
git fetch origin semantic
git-semantic hydratename: Semantic Index
on:
push:
branches: [main]
jobs:
index:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}
- name: Install git-semantic
run: cargo install gitsem
- name: Index codebase
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: git-semantic index
- name: Push semantic branch
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git push origin semanticgit-semantic config provider gemmaModel files are cached at ~/.cache/fastembed by default. Override with FASTEMBED_CACHE_DIR.
export OPENAI_API_KEY="sk-..."
git-semantic config provider openai| Key | Default | Description |
|---|---|---|
provider |
gemma |
Embedding provider: gemma or openai |
openai.model |
text-embedding-3-small |
OpenAI model |
gemma.embeddingDim |
768 |
Gemma embedding dimension |
Rust, Python, JavaScript, TypeScript, Java, C, C++, Go
git-semantic/
├── src/
│ ├── main.rs # CLI and command handlers
│ ├── map.rs # Subsystem and edge data types
│ ├── clustering.rs # Leiden community detection and edge extraction
│ ├── models.rs # CodeChunk data structure
│ ├── db.rs # SQLite + sqlite-vec + FTS5 hybrid search index
│ ├── embed.rs # Embedding dispatch
│ ├── semantic_branch.rs # Orphan branch read/write via git worktree
│ ├── embeddings/ # OpenAI, ONNX, and Gemma provider implementations
│ └── chunking/ # tree-sitter parsing and language detection
└── Cargo.toml
MIT OR Apache-2.0