Embodied.cpp is an inference runtime for embodied AI models — Vision-Language-Action (VLA) and World-Action Models (WAM) that let robots perceive and act in the real world. It runs these models efficiently on heterogeneous hardware (CPU / CUDA GPU / NPU) using GGUF weights, and ships ready-to-use servers and evaluation clients.
- embodied.cpp
VLA models take sensor images and language instructions as input and output robot action commands directly.
| Model | Status | What it does |
|---|---|---|
| pi0.5 | ✅ | PaliGemma-based policy — great starting point for experimentation |
| HY-VLA | ✅ | Hunyuan dual-tower vision-language model with action head; supports RoboTwin dual-arm evaluation |
| StarVLA | 🚧 | Modular backbone with swappable action heads (coming soon) |
| OpenVLA | 🚧 | 7B open VLA with Llama 2 + DINOv2/SigLIP features (coming soon) |
| Qwen-VLA | 🚧 | Qwen3.5-4B backbone with DiT flow-matching action decoder (coming soon) |
World-Action models predict future video or latent trajectories as part of planning actions — they reason about "what happens next" before deciding what to do.
| Model | Status | What it does |
|---|---|---|
| LingBot-VA | ✅ | Video-action model with VAE bridge — evaluated on LIBERO task suites |
| DreamZero | 🚧 | 14B video-diffusion world model for zero-shot policies (coming soon) |
| UnifoLM-WMA-0 | 🚧 | Multi-embodiment robot learning with world-model action head (coming soon) |
| Being-H0.7 | 🚧 | Latent world-action model with future-aware reasoning (coming soon) |
| FastWAM | 🚧 | Fast WAM that skips test-time future generation for speed (coming soon) |
# Clone the repo and fetch third-party code
git clone <repo-url> && cd embodied.cpp
./patches/init_third_party.shCPU-only:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target vla-server lingbot-world-server -j$(nproc)CUDA GPU:
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_ARCHITECTURES=<your-arch> \
-DProtobuf_PROTOC_EXECUTABLE=/usr/bin/protoc
cmake --build build --target lingbot-world-server -j$(nproc)# VLA server (pi0.5, HY-VLA)
./build/vla-server --model <path-to-gguf> (<path-to-mmproj>)
# LingBot world-action server
./build/lingbot-world-server --model <path-to-gguf># Install the LIBERO runtime once
bash eval/sim/libero/setup_libero.sh
# Run a test episode
eval/sim/libero/libero_uv/.venv/bin/python eval/client/run_sim_client_direct.py \
--arch lingbot_va \
--libero-suite object \
--task-id 0 \
--n-episodes 1 \
--tokenizer /path/to/lingbot-va-tokenizer \
--vla-addr tcp://localhost:5555| Executable | What it serves |
|---|---|
./build/vla-server |
VLA models — takes observations + text, outputs robot action chunks |
./build/lingbot-world-server |
LingBot-VA world-action model — video-conditioned future-aware planning |
./build/hy-vla-direct-debug |
Debug HY-VLA in-process (no server) |
Run with --help to see all model, checkpoint, and quantization options.
LIBERO tests robotic manipulation skills on four task suites: spatial, object, goal, and 10. A fifth suite long (90 tasks) is also available.
--libero-suite spatial → libero_spatial
--libero-suite object → libero_object
--libero-suite goal → libero_goal
--libero-suite 10 → libero_10
--libero-suite long → libero_90Use --task-id 0..9 (or 0..89 for long) to pick individual tasks.
RoboTWIN is a dual-arm robot benchmark with real-world-style manipulation tasks. Run HY-VLA natively in C++:
bash eval/sim/robotwin/setup_robotwin.sh # one-time setup
GGML_CUDA_DISABLE_GRAPHS=1 \
eval/sim/robotwin/robotwin_uv/.venv/bin/python \
eval/client/run_robotwin_native_hy_vla.py \
--model <path-to-gguf> \
--task-name place_empty_cup \
--episodes 1See eval/sim/robotwin/README.md for detailed setup modes and troubleshooting.
GGUF conversion scripts are in scripts/:
| Script | Converts |
|---|---|
convert_pi05_to_gguf.py |
pi0.5 model weights |
convert_pi05_mmproj_to_gguf.py |
pi0.5 multimodal projector |
convert_hy_vla_to_gguf.py |
HY-VLA combined vision+action |
convert_lingbot_va_to_gguf.py |
LingBot-VA transformer + companion GGUFs |
Quantization helpers:
| Script | Quantizes |
|---|---|
quantize_hy_vla_gguf.py |
HY-VLA models |
quantize_lingbot_wan_gguf.py |
LingBot-VA models |
What lives where, in plain language:
| Directory | What it contains |
|---|---|
models/ |
C++ model implementations (pi0.5, HY-VLA, LingBot-VA) |
runtime/ |
Model registry, architecture detection, shared utilities |
adapter/ |
I/O boundary — translates sensor/simulator data into typed inputs the models understand |
serving/ |
Server code (ZeroMQ/Protobuf) for VLA and LingBot APIs |
kernels/ |
Custom CUDA kernels (used when building with GPU support) |
scripts/ |
GGUF conversion, quantization, and evaluation helpers |
tools/ |
Local debug utilities |
patches/ |
Third-party code patches applied during setup |
eval/ |
Evaluation clients and simulation setups (LIBERO, RoboTwin) |
This project is released under the Apache License 2.0. Third-party dependencies, model checkpoints, datasets, and upstream reference implementations are distributed under their own licenses.
Supported models:
Foundational projects this build depends on:
