Skip to content

Fame510/SINGULARITY

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SINGULARITY: Disaggregated KV-Cache Service (DKCS)

Status: Private Alpha | Target: $1B+ Acquisition
Owner: Dante Bullock, Sovereign Logic


What It Is

SINGULARITY is a Distributed Virtual Memory Manager (DVMM) for AI compute. It decouples the Model Weights (static, loaded once) from the KV-Cache/Context (dynamic, fluid) across a GPU cluster. The result: GPU utilization jumps from 25% to 85%, and "Cold Start" latency drops from 60-120 seconds to <50ms.

The arbitrage: Companies are burning 60-70% of their GPU spend on idle overhead because Kubernetes can't handle "Flash-Compute" (dynamic, spiking agent workloads). SINGULARITY turns a GPU cluster into a single, fluid memory pool.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    SINGULARITY STACK                         │
├─────────────────────────────────────────────────────────────┤
│  Python SDK: `import singularity`                           │
│  (singularity-python/)                                       │
├─────────────────────────────────────────────────────────────┤
│  Predictive Oracle (CLOSED SOURCE): Sparse Attention Router  │
│  (singularity-oracle/)                                       │
├─────────────────────────────────────────────────────────────┤
│  Transport Layer (OPEN SOURCE): RDMA/RoCE v2 Fabric         │
│  (singularity-transport/)                                    │
├─────────────────────────────────────────────────────────────┤
│  vLLM Connector: Native KVConnector Backend                  │
│  (singularity-core/)                                         │
├─────────────────────────────────────────────────────────────┤
│  Hardware: H100/A100 GPUs + InfiniBand/RoCE v2               │
└─────────────────────────────────────────────────────────────┘

Key Innovation: Sparse Teleportation

Moving 16GB of KV-cache takes ~320ms over 400Gbps (physics). Our solution:

  1. Head-First Protocol: Identify the 12% of KV-cache blocks that contain 85% of attention weight
  2. Teleport the Head: Move 128-256MB in <5ms via RDMA
  3. Lazy Stream: Background-load the remaining 88% while generation runs
  4. Result: Perceived cold start <50ms, actual throughput = 26x faster

Business Model: Open-Core

Tier What License
Community Transport layer, basic KV Connector, daemon Apache 2.0
Enterprise Predictive Oracle, Sparse Teleportation, Dashboard Proprietary

Exit Strategy: Sell to CoreWeave/Lambda Labs (GPU clouds bleeding on idle H100s) or Microsoft Azure (win the cloud war). Show them 40% GPU waste recovery = $200M+/year OpEx savings.

Repo Structure

singularity-core/     — vLLM Native KVConnector (C++/Python, open source)
singularity-transport/ — RDMA/RoCE v2 fabric (Rust, open source)
singularity-oracle/    — Sparse Attention Router (Rust, CLOSED SOURCE)
singularity-python/    — Python SDK (pip install singularity)
scripts/               — Benchmarks, deployment
docs/                  — Architecture, API docs

The Pitch

"We decoupled Thought (Session Context) from Brain (GPU Weights). You're treating H100s like 1990s mainframes. We turn them into a Global Memory Fabric. We are the OS for the Sovereign AI Era."

About

Disaggregated KV-Cache Service — Zero-latency compute fabric for AI agents. Proprietary systems engineering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors