Skip to content

smriti-kumar/distributed-compiler-cache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

distc — Distributed Compiler Cache

A peer-to-peer distributed build cache for C, C++, and Rust projects. When a file is compiled using distc, the resulting binary is stored across a cluster of local storage nodes. The next time anyone on the network compiles the same file with the same flags, compilation is skipped entirely and the pre-built artifact is pulled from the cache, turning a 300ms build into a 5ms download. distc is built from scratch as a systems engineering project, implementing content-addressable storage, consistent hashing, and fault-tolerant replication without any third-party caching infrastructure.

Performance examples

1. Single File Compilation

(base) $ distc build demo/single_files/hello.rs
Cache miss, compiling locally
Compilation succeeded. Binary is 442864 bytes.
Stored binary in cache
Build time: 744.9912109375 ms

(base) $ distc build demo/single_files/hello.rs
Cache hit, downloading binary
Build time: 5.7978515625 ms

2. Multi-File Project Build

(base) $ distc build-project demo/cpp_project          
hello.cpp: Cache miss, compiling to object
Compilation succeeded. Object file is 10504 bytes.
Stored binary in cache
main.cpp: Cache miss, compiling to object
Compilation succeeded. Object file is 10432 bytes.
Stored binary in cache
Linking succeeded. Binary is 39120 bytes.
Successfully generated executable target
Compiled 2/2 files (0 cache hits) in 587.31103515625 ms

(base) $ distc build-project demo/cpp_project
hello.cpp: Cache hit, downloading object
main.cpp: Cache hit, downloading object
Linking succeeded. Binary is 39120 bytes.
Successfully generated executable target
Compiled 0/2 files (2 cache hits) in 36.925048828125 ms

How it works

Before invoking the compiler, distc hashes the source file contents and compiler flags into a SHA-256 fingerprint. This fingerprint is used as a key to query a cluster of HTTP storage nodes. On a cache hit, the binary is downloaded directly. On a miss, the real compiler runs and the result is stored across multiple nodes for redundancy, based on the replication factor which can be set in the cluster configurations.

Nodes are routed using a consistent hash ring, so artifacts are always stored and retrieved from the same deterministic location without broadcasting to the entire cluster. If a node goes offline, the tool falls back to a replica or compiles locally, so that the build never fails due to cache infrastructure.

Installation

Requires Python 3.10+ and g++, gcc and rustc depending on what projects distc is used for.

git clone https://github.com/smriti-kumar/distributed-compiler-cache
cd distributed-compiler-cache
pip install -e .

Start the storage cluster (default is three nodes on localhost at ports 5001, 5002, and 5003):

bash scripts/run_nodes.sh

Stop the cluster:

bash scripts/stop_nodes.sh

Usage

# Compile a single file
distc build hello.cpp

# Compile with flags
distc build main.cpp -- -O2 -Wall

# Compile a multi-file C/C++ project with per-file object caching and linking
distc build-project src/

# Compile a Rust crate with whole-crate fingerprinting
distc build-project my_crate/

# Print the cache fingerprint for a file and node it would be hashed to without building
distc inspect main.cpp

# Check cluster health
distc status

# Clear all cached artifacts from all nodes
distc flush

Features

Content-addressable storage: artifacts are keyed by the SHA-256 hash of source contents and compiler flags. Changing one character produces a new key, and identical inputs always resolve to the same artifact.

Consistent hashing: a virtual node hash ring distributes artifacts across the cluster deterministically. Reads and writes always route to the same node without cluster-wide broadcasts.

Replication: each artifact is stored on multiple nodes, based on the replication factor set in the cluster configurations (default to 2 nodes). If the primary node is unavailable, the replica serves the request transparently.

Fault tolerance: if a node is unreachable during a build, it is marked offline for the remainder of that session and all operations route around it. If all nodes are offline, distc falls back to local compilation. The build never fails due to cache infrastructure.

Multi-file project builds: for C/C++ projects, each source file is compiled independently to an object file and cached separately. Only files that changed since the last build are recompiled; unchanged files are downloaded from cache. For Rust, the entire crate is fingerprinted as a unit, consistent with rustc's crate compilation model.

Language support: C (.c via gcc), C++ (.cpp/.cc via g++), Rust (.rs via rustc).

Sharing distc across machines

By default all nodes run on localhost. To share a cache across multiple machines on the same network, update cluster_config.json with the LAN IP addresses of the machines running node/node_server.py.

For machines across different networks, Tailscale creates a private mesh network between machines. Replace localhost with each machine's Tailscale IP in the config. No code changes are required.

Project structure

distributed-compiler-cache/ 
├── distc/
│ ├── cli.py 
│ ├── cache_client.py 
│ ├── compiler_wrapper.py 
│ ├── consistent_hash.py 
│ └── health_tracker.py 
├── node/ 
│ └── node_server.py 
├── scripts/ 
│ ├── run_nodes.sh 
│ └── stop_nodes.sh 
├── config/ 
│ └── cluster_config.json 
├── demo/ 
│ ├── README.md
│ └── ... 
└── pyproject.toml

About

A peer-to-peer distributed build cache for C, C++, and Rust projects that reduces a 300ms build into a 5ms download. distc is built from scratch as a systems engineering project, implementing content-addressable storage, consistent hashing, and fault-tolerant replication without any third-party caching infrastructure.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors