Skip to content

deepubuntu/cowcow

Repository files navigation

CowCow

CowCow is a Rust-based priority data plane for physical AI. It helps edge devices capture, verify, rank, and synchronize high-volume data when connectivity is unreliable.

In plain terms: CowCow decides what edge AI data deserves bandwidth first.

Robots, vehicles, drones, cameras, phones, and field devices can collect more data than teams can upload right away. CowCow keeps the raw data local, breaks it into verifiable chunks, explains why some samples should move first, and resumes safely after Wi-Fi or processes fail.

Why It Matters

Physical AI teams do not just need more data. They need to know which data is urgent, trustworthy, private, corrupted, repetitive, or worth reviewing.

CowCow is built for the boring but important part of that workflow:

  • ingest files without moving the originals
  • compute fast BLAKE3 hashes
  • split files into content-addressed chunks
  • keep an append-only event journal
  • run cheap local quality checks
  • assign explainable sync priority
  • hold data blocked by policy or privacy
  • sync important chunks first
  • resume after interrupted transfers
  • prove local and remote hashes match
  • generate dataset and integrity reports

CowCow does not delete raw data automatically.

What CowCow Is Not

CowCow is not a dashboard. It is not a data labeling tool. It is not another dataset manager.

It also does not replace MCAP, Zenoh, ReductStore, rclone, restic, lakeFS, or NVIDIA Holoscan. CowCow sits around the edge data flow as the local-first priority, provenance, and integrity layer.

Get Started

Prerequisites

  • Rust 1.70 or newer
  • Git

Build and test

git clone https://github.com/deepubuntu/cowcow.git
cd cowcow
cargo test

The integration tests cover interrupted sync, deleted-remote resume, and the fixture pack pipeline.

Generate test fixtures

CowCow ships a small multimodal fixture pack for local testing (synthetic payloads, not real sensor recordings):

./scripts/generate-fixtures.sh

See examples/fixtures/README.md for layout and roles (video, LiDAR-like .bin, images, telemetry, held/private sample).

Run the demo pipeline

rm -rf fleet-demo remote

cargo run -p cowcow-cli -- init fleet-demo
cargo run -p cowcow-cli -- ingest ./examples/fixtures --project fleet-demo
cargo run -p cowcow-cli -- chunk --project fleet-demo --chunk-size 64kb
cargo run -p cowcow-cli -- qc --project fleet-demo
cargo run -p cowcow-cli -- score --project fleet-demo
cargo run -p cowcow-cli -- doctor --project fleet-demo
cargo run -p cowcow-cli -- sync ./remote --project fleet-demo --resume --priority urgent,high,normal
cargo run -p cowcow-cli -- verify --project fleet-demo
cargo run -p cowcow-cli -- report --project fleet-demo

Re-running ingest on the same files reports duplicates instead of silently doing nothing:

Ingested 0 new sample(s), skipped 9 duplicate(s), scanned 9 file(s)

Failure demo

rm -rf remote
cargo run -p cowcow-cli -- simulate network-failure ./remote --project fleet-demo --after-chunks 1

The goal:

100GB in.
Wi-Fi dies.
Process dies.
CowCow resumes.
Important clips moved first.
Hashes match.
Manifest proves chain of custody.

Commands

Command Purpose
init <project> Create project layout and cowcow.yml
ingest <path> --project <dir> Register files; reads manifest.jsonl in the ingest folder when present
chunk --project <dir> --chunk-size 64mb Fixed-size chunks + BLAKE3 hashes
qc --project <dir> Local quality checks (pass / warn / fail, never auto-delete)
score --project <dir> Rule-based priority + sync class
doctor --project <dir> Project health: samples, chunks, missing files
sync <dest> --project <dir> --resume --priority urgent,high,normal Priority filesystem sync
verify --project <dir> Local chunk + manifest integrity
report --project <dir> Dataset, sync, and integrity reports
simulate network-failure ... Interrupt sync, then resume

Project layout:

project/
  cowcow.yml
  .cowcow/
    journal.jsonl
    chunks/
    manifests/
    queue/
  data/ metadata/ qc/ reports/ exports/

Phase 1 status (current)

Phase 1 is the active release line: prove correctness under failure before optimizing for speed or AI.

Shipped:

  • Rust workspace (crates/cowcow-*)
  • Multimodal ingest with hash deduplication
  • Fixture pack + manifest.jsonl metadata overlay
  • Fixed chunking, BLAKE3 verification
  • JSONL manifests, journal, sync state
  • Rule-based priority scoring and policy hold
  • Local filesystem sync with destination-aware resume
  • doctor, verify, reports
  • Integration tests

Good for: internal testing, demos, pilot scripts on edge laptops.

Not yet production for fleets: cloud object storage (S3), compression at scale, real sensor decode (ffprobe / MCAP), ops dashboards.

Planned next

Priority Item
High S3-compatible sync for real uploads
Medium zstd compression, FastCDC chunking
Medium MCAP import/export, Parquet metadata
Later Local AI adapters (Ollama, ONNX) for triage
Later P2P / store-and-forward sync
Later Factory-scale curation exports

Development

cargo check
cargo test
cargo fmt
cargo test -p cowcow-cli --test offline_resume
cargo test -p cowcow-cli --test fixtures_pipeline

About

Offline-first realtime speech data collection toolkit for low-connectivity environments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors