LegoNav is a modular visual language navigation framework — designed like Lego bricks: each module (S1, S2) is independently replaceable and freely composable.
- S2 (High-level reasoning): Any vision-language model — local GPU or cloud API.
- S1 (Low-level motion control): Any navigation policy — NavDP, GNM, ViNT, NoMaD, DD-PPO, iPlanner, ViPlanner, …
Plug in any combination, run on Jetson edge hardware.
- System Architecture
- Project Structure
- Requirements
- Installation
- Quick Start
- S2 — Visual Language Model
- S1 — Navigation Policy
- Extending LegoNav
- Camera Intrinsics
- Troubleshooting
LegoNav decouples high-level "where to go" (S2) from low-level "how to move" (S1).
graph TB
subgraph S2["S2 — Visual Language Reasoning"]
direction TB
VLM["VLM Backend\n─────────────\nlocal: Qwen2.5-VL / Qwen3-VL\napi: GPT-4o / Gemini / Kimi / …"]
S2Srv["S2 HTTP Server :8890\nPOST /s2_step"]
VLM --> S2Srv
end
subgraph Jetson["Jetson Edge — S1 Navigation Policy"]
direction TB
CAM["RGB-D Camera"]
ROS["ROS2 Node\nros_client.py"]
PIPE["LegoNavPipeline\npipeline.py"]
S1["S1 Client\nNavDP / GNM / ViNT / NoMaD\nDD-PPO / iPlanner / ViPlanner"]
CTRL["MPC + PID Controller"]
ROBOT["Mobile Robot"]
CAM --> ROS
ROS --> PIPE
PIPE --> S1
S1 --> CTRL
CTRL -->|"v, ω"| ROBOT
end
USER["Language Instruction\n'Go to the black chair'"]
USER --> ROS
PIPE <-->|"HTTP REST (LAN)"| S2Srv
LegoNav uses World-coordinate goal tracking to ensure geometric consistency without constant VLM re-queries.
sequenceDiagram
participant Cam as RGB-D Camera
participant ROS as ROS2 Node (+ Odometry)
participant S2 as S2 Server (VLM)
participant Pipe as LegoNavPipeline
participant S1 as S1 Policy
participant Robot as Robot Actuator
ROS->>S2: POST /s2_step {image, instruction}
S2-->>ROS: Structured task list (JSON) — pixel goal / rotation / stop
loop Per Frame
Cam->>ROS: RGB-D frame + odom [x, y, yaw]
ROS->>Pipe: step(rgb, depth, odom)
alt First step of pixel_point task
Pipe->>Pipe: pixel + depth + odom → world_target [wx, wy, wz]
Note over Pipe: 3D anchor locked once, never re-identified
end
Pipe->>Pipe: world_target + odom → camera_goal [x_fwd, y_left, z_up]
Pipe->>S1: pointgoal_step(camera_goal, rgb, depth)
S1-->>Pipe: trajectory (B, T, 3)
Pipe-->>ROS: {mode, trajectory, camera_goal, …}
ROS->>Robot: MPC / PID execution
end
graph LR
subgraph ModeA["Mode A: Local S1 (Recommended)"]
J1["Jetson"] -- "local inference" --> N1["NavDP (--local_s1)"]
J1 <-- "HTTP :8890" --> G1["GPU Server (S2)"]
end
subgraph ModeB["Mode B: Remote S1"]
J2["Jetson"] <-- "HTTP :8901" --> N2["S1 Policy Server"]
J2 <-- "HTTP :8890" --> G2["GPU Server (S2)"]
end
subgraph ModeC["Mode C: S2 API (No GPU Required)"]
PC["Any Machine"] -- "runs S2 server" --> SRV["S2 Server :8890"]
SRV <-- "HTTPS" --> API["Cloud VLM API\nOpenAI / Gemini / Kimi / Qwen"]
end
LegoNav/
├── legonav/
│ ├── server/ # S2: VLM HTTP server (port 8890)
│ ├── clients/ # S1: Policy clients (NavDP, GNM, ViNT, etc.)
│ ├── core/ # Orchestration (LegoNavPipeline)
│ ├── robot/ # ROS2 node & MPC/PID controllers
│ └── utils/ # Shared utilities
├── scripts/ # Launch scripts for Jetson & S2
├── tests/ # Connectivity & pipeline tests
└── requirements_*.txt # Environment-specific dependencies
| Component | Minimum | Recommended |
|---|---|---|
| S2 GPU (local) | 16 GB VRAM | 24+ GB VRAM |
| S1 Edge | Jetson Orin NX 8 GB | Jetson Orin NX 16 GB |
| Camera | Astra S (640×480) | Gemini 336L (1280×720) |
- Python 3.10+, PyTorch 2.1+, ROS2 Humble, CUDA 11.8+.
conda create -n legonav_s2 python=3.10 && conda activate legonav_s2
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -e . && pip install -r requirements_server.txtconda create -n legonav_s1 python=3.10 && conda activate legonav_s1
# Clone NavDP as a sibling: git clone https://github.com/InternRobotics/NavDP
# Install prebuilt torchvision wheel for aarch64 if available
pip install -e . && pip install -r requirements_jetson.txt
sudo apt install ros-humble-cv-bridge ros-humble-message-filters# Local GPU (Qwen2.5-VL)
python -m legonav.server.s2_server --model_path /path/to/Qwen2.5-VL-7B-Instruct
# Cloud API (GPT-4o)
OPENAI_API_KEY=sk-xxx python -m legonav.server.s2_server --backend api --provider openai --model_path gpt-4opython tests/test_s2_client.py --host 127.0.0.1 --port 8890 --random --instruction "Go to the chair"python -m legonav.core.pipeline \
--s2_host 127.0.0.1 --s2_port 8890 \
--random --skip_s1 \
--instruction "Turn left, go to the door"# Terminal 1: Robot base & Camera
ros2 launch wheeltec_robot base_node.launch.py
ros2 launch orbbec_camera gemini_336l.launch.py
# Terminal 2: LegoNav ROS2 Node
conda activate legonav_s1
python -m legonav.robot.ros_client \
--instruction "Go to the black chair" \
--s2_host 192.168.1.100 \
--local_s1 --s1_checkpoint /path/to/navdp.ckpt --s1_half- Local (
--backend local): Load model weights on your GPU (Qwen2.5-VL, Qwen3-VL). - API (
--backend api): Call external VLM via OpenAI-compatible API.
| Provider | --provider |
Example Models | Env Var |
|---|---|---|---|
| OpenAI | openai |
gpt-4o, gpt-4-turbo |
OPENAI_API_KEY |
gemini |
gemini-1.5-pro, gemini-2.0-flash |
GEMINI_API_KEY |
|
| Moonshot | kimi |
moonshot-v1-vision |
MOONSHOT_API_KEY |
| DashScope | qwen |
qwen-vl-max |
DASHSCOPE_API_KEY |
All S1 clients inherit BaseS1Client, making them easily swappable in the LegoNavPipeline.
| Client | Base Model | Goal Support | Stop mechanism |
|---|---|---|---|
NavDPClient |
NavDP | Pixel, Point, Image, No-goal | Learned Critic |
GNMClient |
GNM | Image, No-goal | Distance |
ViNTClient |
ViNT | Image, No-goal | Distance |
NoMaDClient |
NoMaD | Image, No-goal | Distance |
DDPPOClient |
DD-PPO | Pixel, Point | Action=STOP |
ViPlannerClient |
ViPlanner | Pixel, Point | Distance |
mode |
Key Fields |
|---|---|
"trajectory" |
trajectory (1,T,3), all_trajectory, values, target, camera_goal [x,y,z] |
"rotate" |
rotation_rad (positive = CCW / left) |
"stop" |
— |
"error" |
message, s2 (raw S2 response) |
Dropping in a new S1 policy is as simple as implementing BaseS1Client:
from legonav.clients.base_client import BaseS1Client
class MyPolicyClient(BaseS1Client):
algo_name = "my_policy"
def pixelgoal_step(self, pixel_goals, rgb_images, depth_images):
# Your model logic here
return self._wrap_single_trajectory(traj)
# Usage in pipeline
pipeline = LegoNavPipeline(s2_host="...", s1_client=MyPolicyClient(...))Helper methods available in BaseS1Client:
_wrap_single_trajectory(traj): Wraps(B,T,3)into the standard response tuple._waypoints_to_trajectory(wps, T): Converts waypoints to trajectory.
| Camera | Resolution | Constant |
|---|---|---|
| Gemini 336L (default) | 1280×720 | GEMINI_336L_INTRINSIC |
| Astra S | 640×480 | ASTRA_S_INTRINSIC |
ModuleNotFoundError: Runpip install -e .from the repository root.NAVDP_ROOTerror:navdp_agent.pyexpectsNavDP/as a sibling directory.- Jetson OOM: Use
--s1_halffor FP16 inference. - S2 503 error: The model is still loading; wait for
/healthto return"ok".
Special thanks to the authors of NavDP, Qwen, VisualNav Transformer, and Habitat for their foundational models and frameworks.