Skip to content

SEU-PAISys/Embodied.cpp

Repository files navigation

embodied.cpp

embodied.cpp overview

Embodied.cpp is an inference runtime for embodied AI models — Vision-Language-Action (VLA) and World-Action Models (WAM) that let robots perceive and act in the real world. It runs these models efficiently on heterogeneous hardware (CPU / CUDA GPU / NPU) using GGUF weights, and ships ready-to-use servers and evaluation clients.


Table of Contents


Supported Models

VLA Models

VLA models take sensor images and language instructions as input and output robot action commands directly.

Model Status What it does
pi0.5 PaliGemma-based policy — great starting point for experimentation
HY-VLA Hunyuan dual-tower vision-language model with action head; supports RoboTwin dual-arm evaluation
StarVLA 🚧 Modular backbone with swappable action heads (coming soon)
OpenVLA 🚧 7B open VLA with Llama 2 + DINOv2/SigLIP features (coming soon)
Qwen-VLA 🚧 Qwen3.5-4B backbone with DiT flow-matching action decoder (coming soon)

World-Action Models

World-Action models predict future video or latent trajectories as part of planning actions — they reason about "what happens next" before deciding what to do.

Model Status What it does
LingBot-VA Video-action model with VAE bridge — evaluated on LIBERO task suites
DreamZero 🚧 14B video-diffusion world model for zero-shot policies (coming soon)
UnifoLM-WMA-0 🚧 Multi-embodiment robot learning with world-model action head (coming soon)
Being-H0.7 🚧 Latent world-action model with future-aware reasoning (coming soon)
FastWAM 🚧 Fast WAM that skips test-time future generation for speed (coming soon)

Quick Start

1. Prepare dependencies

# Clone the repo and fetch third-party code
git clone <repo-url> && cd embodied.cpp
./patches/init_third_party.sh

2. Build

CPU-only:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target vla-server lingbot-world-server -j$(nproc)

CUDA GPU:

cmake -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DCMAKE_CUDA_ARCHITECTURES=<your-arch> \
  -DProtobuf_PROTOC_EXECUTABLE=/usr/bin/protoc
cmake --build build --target lingbot-world-server -j$(nproc)

3. Start a server

# VLA server (pi0.5, HY-VLA)
./build/vla-server --model <path-to-gguf> (<path-to-mmproj>)

# LingBot world-action server
./build/lingbot-world-server --model <path-to-gguf>

4. Evaluate in simulation (example: LIBERO with LingBot-VA)

# Install the LIBERO runtime once
bash eval/sim/libero/setup_libero.sh

# Run a test episode
eval/sim/libero/libero_uv/.venv/bin/python eval/client/run_sim_client_direct.py \
  --arch lingbot_va \
  --libero-suite object \
  --task-id 0 \
  --n-episodes 1 \
  --tokenizer /path/to/lingbot-va-tokenizer \
  --vla-addr tcp://localhost:5555

Run a Server

Executable What it serves
./build/vla-server VLA models — takes observations + text, outputs robot action chunks
./build/lingbot-world-server LingBot-VA world-action model — video-conditioned future-aware planning
./build/hy-vla-direct-debug Debug HY-VLA in-process (no server)

Run with --help to see all model, checkpoint, and quantization options.


Evaluate in Simulation

LIBERO

LIBERO tests robotic manipulation skills on four task suites: spatial, object, goal, and 10. A fifth suite long (90 tasks) is also available.

--libero-suite spatial  → libero_spatial
--libero-suite object   → libero_object
--libero-suite goal     → libero_goal
--libero-suite 10       → libero_10
--libero-suite long     → libero_90

Use --task-id 0..9 (or 0..89 for long) to pick individual tasks.

RoboTwin

RoboTWIN is a dual-arm robot benchmark with real-world-style manipulation tasks. Run HY-VLA natively in C++:

bash eval/sim/robotwin/setup_robotwin.sh   # one-time setup

GGML_CUDA_DISABLE_GRAPHS=1 \
eval/sim/robotwin/robotwin_uv/.venv/bin/python \
  eval/client/run_robotwin_native_hy_vla.py \
  --model <path-to-gguf> \
  --task-name place_empty_cup \
  --episodes 1

See eval/sim/robotwin/README.md for detailed setup modes and troubleshooting.


Convert Your Own Model

GGUF conversion scripts are in scripts/:

Script Converts
convert_pi05_to_gguf.py pi0.5 model weights
convert_pi05_mmproj_to_gguf.py pi0.5 multimodal projector
convert_hy_vla_to_gguf.py HY-VLA combined vision+action
convert_lingbot_va_to_gguf.py LingBot-VA transformer + companion GGUFs

Quantization helpers:

Script Quantizes
quantize_hy_vla_gguf.py HY-VLA models
quantize_lingbot_wan_gguf.py LingBot-VA models

Project Structure

What lives where, in plain language:

Directory What it contains
models/ C++ model implementations (pi0.5, HY-VLA, LingBot-VA)
runtime/ Model registry, architecture detection, shared utilities
adapter/ I/O boundary — translates sensor/simulator data into typed inputs the models understand
serving/ Server code (ZeroMQ/Protobuf) for VLA and LingBot APIs
kernels/ Custom CUDA kernels (used when building with GPU support)
scripts/ GGUF conversion, quantization, and evaluation helpers
tools/ Local debug utilities
patches/ Third-party code patches applied during setup
eval/ Evaluation clients and simulation setups (LIBERO, RoboTwin)

License

This project is released under the Apache License 2.0. Third-party dependencies, model checkpoints, datasets, and upstream reference implementations are distributed under their own licenses.


Acknowledgements

Supported models:

Foundational projects this build depends on:

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors