Skip to content

Evaluate using Git cli instead of LibGit2Sharp for high performance history reading #7

@rian-be

Description

@rian-be

Problem

ChangeTrace currently relies on LibGit2Sharp to reconstruct repository history and produce TraceEvent streams. The architecture and domain models are solid and maintainable, but profiling shows that some operations in history reader become significant performance bottlenecks when working with large repositories, e.g., 100k–200k commits.

The main performance issues are:

  • Repeated commit graph traversal for branch mapping (BuildCommitToBranchMap)
  • Object allocation for each Commit
  • Per-commit TreeDiff operations
  • Use of Task.Run in synchronous logic, which does not provide true asynchronous benefits

Meanwhile, Git CLI (git log) is highly optimized for streaming commit history directly from packfiles and can perform many of these operations far more efficiently.


Goal

The goal of this investigation is to determine which parts of history reconstruction pipeline would benefit from using Git CLI versus LibGit2Sharp. The intent is not to remove LibGit2Sharp entirely — it remains valuable for repository manipulation, branch operations, and detailed diff inspection — but to find the optimal split between:

  • Git CLI – for fast history extraction and commit metadata streaming
  • LibGit2Sharp – for operations requiring in depth tree inspection or diff calculation

A potential hybrid pipeline could look like this:

git log  
   │  
   ▼  
Commit metadata stream  
   │  
   ▼  
TraceEvent reconstruction  
   │  
   ▼  
(optional) LibGit2Sharp diff operations

Areas to Investigate

  1. History Extraction – Measure performance of git log --pretty=... and git log --name-status compared to the current LibGit2Sharp commit traversal. Identify which approach gives the fastest timeline reconstruction.
  2. Branch Attribution – Explore alternatives to BuildCommitToBranchMap, such as git log --decorate or git branch --contains. Determine whether current branch mapping is strictly necessary or can be simplified
  3. Diff Strategy – Assess whether all commits truly require TreeDiff, or if some information can be derived directly from CLI output, reducing allocations and computation.
  4. Async Model – Check whether wrapping synchronous operations in Task.Run actually provides any benefit, or if purely synchronous execution inside history reader would be simpler and faster.

Expected Outcome

By the end of this evaluation, we should clearly understand:

  • Which parts of the pipeline benefit most from Git CLI streaming
  • Which operations should remain handled by LibGit2Sharp
  • How hybrid approach could dramatically improve performance without sacrificing correctness or maintainability

Benefits

Implementing hybrid approach could significantly speed up timeline reconstruction for large repositories, reduce memory allocations, minimize redundant commit graph traversals, and improve ChangeTrace scalability — all while keeping the architecture and domain models clean and testable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions