A zero-copy shared-memory IPC framework for local AI workloads. C++17 core, lock-free SPSC data path, crash-resilient, with a frozen C ABI driven from Python and Rust.
In a loom, the shuttle carries the thread back and forth across the warp. Here it carries 50 MB tensors between processes in microseconds.
50 MB payload, end-to-end (producer commit → consumer holds the payload):
transport median vs Shuttle
─────────────────────────────────────────
Shuttle 5 µs —
Unix socket 9.3 ms 1,857× slower
HTTP (raw) 8.5 ms 1,699× slower
(native Apple M-series, macOS — dev figures; see "Benchmark honesty" below)
Local AI stacks are polyglot: a Rust/Tauri frontend, Python sidecars, a C++ inference engine — all on one machine, shoveling large binary payloads (audio frames, embeddings, LLM context windows) between processes over localhost HTTP. On that path a 50 MB tensor is copied into a kernel socket buffer, through the loopback stack, into the receiver's socket buffer, and framed/deframed by HTTP — several full copies plus protocol overhead, per message.
Shuttle replaces that path for same-host communication: one region of physical RAM is mapped into both processes via POSIX shared memory. The producer writes a payload once; the consumer reads it in place. Measured consumer-side cost of receiving 2 GB over the borrow path: 0.22 ms of CPU — 0.03% of what the same bytes cost over a Unix socket.
- Strictly SPSC, point-to-point. One writer, one reader per channel; multi-process stacks compose pairwise channels. This is what makes the lock-free data path sound: every shared cursor has exactly one writer.
- Bipartite buffer (BipBuffer), not a plain ring. Every reserved block is physically contiguous — a payload never straddles the wrap point, so the zero-copy pointer handoff is always valid. Cursor model is three absolute offsets (
read/write/watermark, bbqueue-style), each strictly single-writer. - Lock-free hot path. Cursors are atomics published with release stores and observed with acquire loads. The full happens-before argument for every shared atomic is written inline in
include/shuttle/spsc.hpp. - Parking, not polling. A blocked peer sleeps (idle cost measured at 0.05% CPU) and wakes in microseconds. The park decision uses a seq_cst Dekker protocol to close the classic store→load race; every wait is a bounded timedwait — nothing can sleep forever.
- Backpressure, never drops. A full buffer blocks the producer; data integrity is non-negotiable for embeddings and context windows. Oversized writes fail fast instead of blocking forever (validated at channel creation).
- Crash resilience. Heartbeat liveness is the primary mechanism on both platforms: a peer SIGKILLed mid-transfer — even while holding the park mutex — leaves the survivor with a clean
PEER_DEADerror, never a deadlock. Linux adds robust-mutex (EOWNERDEAD) recovery; macOS parks onos_sync_wait_on_address, which holds nothing a dying process could orphan. - Frozen C ABI. Ten functions, integer error codes, no exception ever crosses the boundary (
include/shuttle/shuttle_c.h). Python binds via cffi with a zero-copymemoryviewthat invalidates on release; the Rust wrapper makes use-after-release a compile error (E0597) via borrow lifetimes.
Requirements: macOS (Apple silicon) with Xcode CLT + CMake, and Docker Desktop for the Linux leg. The FFI tests additionally use python3 + cffi and rustc (both preinstalled in the provided container image).
make test-mac # native build + full test suite under ASan/UBSan
make test-linux # the same, inside a glibc arm64 container (--shm-size=512m)
make tsan-mac # ThreadSanitizer legs (separate build trees)
make tsan-linuxMinimal producer/consumer over the C ABI:
#include <shuttle/shuttle_c.h>
/* producer process */
int err;
shuttle_channel* ch = shuttle_create("/my-chan", 128u << 20, 64u << 20, &err);
void* span;
shuttle_acquire_write(ch, &span, payload_len, 0); /* contiguous, in-segment */
fill_tensor(span, payload_len); /* write the payload ONCE */
shuttle_commit_write(ch, payload_len);
/* consumer process */
shuttle_channel* ch = shuttle_open("/my-chan", &err);
const void* p; size_t len;
shuttle_acquire_read(ch, &p, &len, 0); /* zero-copy borrow */
run_inference(p, len); /* read in place */
shuttle_release_read(ch);The benchmark harness (shuttle_bench, built unsanitized at -O2) runs all three transports over identical workloads and prints the table above, labeling container runs as virtualized.
The build was driven gate-by-gate through an 8-phase plan (docs/Shuttle_Implementation_Plan.md) with one rule: one new variable per phase — data-structure logic proven before concurrency, concurrency before IPC, ordering before wake mechanics, wake before crash recovery. All 27 gates passed on both platforms; the complete ledger with per-gate evidence, dated decisions, and the failures encountered along the way is in PROGRESS.md.
Highlights of what the suite (28 tests, ASan + TSan clean on both legs) actually proves:
- 200k-pair randomized property test of the BipBuffer with invariants checked after every operation (19k+ wraps in the tight configuration).
- ≥1 GiB two-process byte-exact FIFO stress; asymmetric-speed stress with the spin paths proven engaged; a wrap-heavy stress that fires the delicate A→B handoff 57k times.
- 100k trickle park/wake cycles with zero lost wakeups; hot path verified to take zero locks when the peer isn't parked.
- SIGKILL crash tests at both kill points (mid-transfer, and while holding the park mutex), on both platforms, including proof that the test can fail (a deliberately buggy recovery leaves the mutex
ENOTRECOVERABLE). - Cross-language byte-exact runs (C++→Python, C++→Rust) over the borrow path, and an induced-error sweep showing every failure surfaces as the right integer in all three languages.
- Numbers above are from a native Apple M-series host (macOS) — development figures. Container (Docker on the same host) figures are 24 µs median for the 50 MB blob — still 482×/541× over UDS/HTTP — but are labeled virtualized, not headline.
- The production target is Linux; the headline claim is provisional until the harness runs on bare-metal Linux (
make test-linuxon any glibc box, or runshuttle_benchdirectly). - The HTTP baseline is deliberately fair: raw uncompressed body, keep-alive, TCP_NODELAY, 4 MB socket buffers — HTTP doing the least wasteful thing it can. A Unix-domain-socket baseline is included as the stronger comparator.
- "Zero serialization" applies to payloads already in flat binary layout (PCM,
float32tensors, blobs). Application-level structuring costs exist on every transport and are not what Shuttle removes.
Same-host, single-producer/single-consumer, one-way channels. Cross-machine transport, Windows, MPMC/pub-sub, and payload schemas are explicitly out of scope (see docs/Shuttle_SRS.md). macOS crash recovery is best-effort by design (no robust mutexes exist there); Linux is the hard-guarantee platform.
include/shuttle/ header.hpp (segment layout), bipbuffer.hpp (core logic),
spsc.hpp (lock-free path + parking), platform.hpp (the ONLY
file allowed to #ifdef on platform), shuttle_c.h (C ABI v1)
src/ lifecycle (shm_open/mmap/validate) + C ABI implementation
tests/ 28 gate tests; tests/ffi/ holds the Python + Rust bindings
bench/ three-transport benchmark harness
docs/ SRS, implementation plan, build directive
PROGRESS.md the complete build ledger: every gate, decision, and dead end