Skip to content

RoboMME/robomme_policy_learning

Repository files navigation

🚀 MME-VLA Policy Learning and Evaluation

🧭 Outline

🗞️ Updates

  • [03/2026] We use MME-VLA (FrameSamp+Modul) as an example for RoboMME Challenge @ CVPR2026 Submission. Please see here for more details.
  • [03/2026] We provide MME-VLA as a submission example for CVPR RoboMME Challenge. More details can be found here.
  • [03/2026] 🚀 We release MME-VLA Suite, a family of memory-augmented vision-language-action (VLA) models based on the $\pi_{0.5}$ backbone. See our paper and leaderboard for more details and analysis.

📦 Installation

🧩 Install with UV

📥 Install Policy Learning Repo

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

Set the OPENPI_DATA_HOME path in your ~/.bashrc, e.g. export OPENPI_DATA_HOME=<your_openpi_homedir>. For more details, please refer to OpenPi.

🎮 Install RoboMME Simulator

Clone the RoboMME submodule:

git submodule update --init

Then install the RoboMME environment following the documentation here. We use separate environments for VLA training/inference and the RoboMME simulator. During evaluation, we use a WebSocket connection between them, following OpenPi.

🐳 Install with Docker

After downloading the data in the data directory and setting up runs in the following structure. Update the RoboMME submodule with git submodule update --init. Then build the Docker image following this.

⚡ QuickStart

Evaluation

After installing everything correctly, download our best MME-VLA model (i.e., framesamp-modul) from Hugging Face.

git clone https://huggingface.co/Yinpei/perceptual-framesamp-modul <your_specify_model_path>

Then run

# terminal 0
CUDA_VISIBLE_DEVICES=0 uv run scripts/serve_policy.py --seed=7  --port=8000 policy:checkpoint --policy.dir=<your_specify_model_path>/79999 --policy.config=mme_vla_suite

# terminal 1 
micromamba activate robomme
CUDA_VISIBLE_DEVICES=1 python examples/robomme/eval.py --args.model_seed=7 --args.port=8000 --args.policy_name=<your_specify_policy_name> --args.model_ckpt_id=79999

Then the evaluations results will be stored in runs/evaluation/<your_specify_policy_name>/ckpt79999/seed7

Remember to manually set CUDA_VISIBLE_DEVICES using one card for serve_policy.py, as JAX will automatically use all GPUs by default.

Training

# Build the assets for MME-VLA and copy the provided norm_stats.json
mkdir -p runs/assets/mme_vla_suite 
cp -r assets/norm_stats.json runs/assets/mme_vla_suite

# Download a small set of preprocessed training dataset for quick training 
mkdir data
git clone git@hf.co:datasets/Yinpei/robomme_preprocessed_data_sample data/robomme_preprocessed_data_sample
# unzip the data file

# Train the MME-VLA model
# Memory Requirement: 4xA40 40GB GPU or 1xH100 80GB GPU
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MME_VLA_TYPE="perceptual-framesamp-modul"
XLA_PYTHON_CLIENT_MEM_FRACTION=0.95 uv run scripts/train.py mme_vla_suite \
--exp-name=<model_name> \
--batch-size=64 \
--num-workers=4 \
--fsdp-devices=4 \
--dataset-path=data/robomme_preprocessed_data_sample \
--model.use_history \
--model.history_config="${MME_VLA_TYPE}.yaml"

All possible configs can be found in src/mme_vla_suite/models/config/robomme

🗂️ Repository Structure

.
├── data
│   ├── robomme_h5_data                 # download robomme raw h5 files here
│   └── robomme_preprocessed_data
│   │   ├── data                        # pickle files
│   │   ├── features                    # precompute siglip token embeddings
│   │   ├── meta                        # statistics for robomme
│   │   ├── memer                       # VLM subgoal training data for MemER
│   │   └── qwenvl                      # VLM subgoal training data for QwenVL
├── examples
│   └── robomme                         # RoboMME simulator evaluation code
├── packages
│   └── openpi-client                   # VLA client & server interface
├── runs
│   ├── assets                          # save norm_stats json files
│   ├── ckpts                           # fine-tuned checkpoints
│   └── evaluation                      # evaluation results
├── scripts                             # train/eval/data_generation scripts
├── src
│   ├── mme_vla_suite                   # MME_VLA code, follows openpi structure 
│   └── openpi                          # original OpenPi code with minor changes
└── third_party

This repository is built on top of OpenPi. We highly recommend becoming familiar with OpenPi first before working with this repo.

⬇️ Download

🧱 Download Training Data

Place all data under the data directory:

mkdir data && cd data

Download the raw RoboMME training files here:

git clone git@hf.co:datasets/Yinpei/robomme_data_h5 data/robomme_data_h5

Then run

uv run scripts/tarxz_h5.py decompress --input_dir data/robomme_data_h5 --jobs 16 --remove_archive

to extract the *.xz files.

(Optional) Download preprocessed RoboMME data here:

git clone git@hf.co:datasets/Yinpei/robomme_preprocessed_data data/robomme_preprocessed_data

and run uv run scripts/unzip_data.py data/robomme_preprocessed_data to unzip the files.

Alternatively, you can run uv run scripts/build_dataset.py to generate the preprocessed pickle files (takes about 2–3 hours) and/or the VLM subgoal predictor training data (takes about 30–60 minutes).

We also provide data in the LeRobot format here. In our experiments, however, the LeRobot dataloader significantly increased CPU memory usage during training, which can be a bottleneck in shared training environments (e.g., on HPC clusters). For this reason, we use our custom data format and dataloader in this repository.

🧠 Download Pre-trained Models

Download the $\pi_{0.5}$-base backbone:

uv run scripts/download_pi05_base.py

Download the pi05_vision_encoder, which is a subset of the $\pi_{0.5}$ parameters used for dataset feature construction without loading the full model. Visual token embeddings are computed and cached for training, and the vision encoder remains frozen in our experiment:

cd $OPENPI_DATA_HOME
git clone git@hf.co:Yinpei/pi05_vision_encoder

🧪 Download Fine-tuned VLA/VLM Checkpoints (Optional)

Fine-tuned models and evaluation results are stored under the runs directory. Create it if needed:

mkdir runs
mkdir runs/ckpts        # save all trained models here
mkdir runs/evaluation   # evaluation results
mkdir runs/assets       # save all normalization statistics files here

You can skip the following steps if you plan to fine-tune your own VLA/VLM models directly; see Model Training.

Download MME-VLA variants here:

git clone git@hf.co:Yinpei/mme_vla_suite runs/ckpts/mme_vla_suite

We release all checkpoints for symbolic and perceptual memory, and a subset of recurrent memory variants for research. Recurrent memory is still underperforming; we will release more recurrent variants as results improve.

Download VLM subgoal predictors here:

git clone git@hf.co:Yinpei/vlm_subgoal_predictor runs/ckpts/vlm_subgoal_predictor

Download the fine-tuned $\pi_{0.5}$ baseline here:

git clone git@hf.co:Yinpei/pi05_baseline runs/ckpts/pi05_baseline

After downloading fine-tuned checkpoints, you can run

uv run ./scripts/unzip_ckpt.py runs/ckpts

to unzip all of them.

🏋️ Model Training

🧰 Data Preparation

Prepare training data by either downloading preprocessed files or running:

uv run scripts/build_dataset.py --dataset_type robomme_pkl --raw_data_path <downloaded_h5_data_dir> --preprocessed_data_path <your_target_dir>

Then compute normalization statistics (this takes about 3 minutes):

uv run scripts/compute_norm_stats.py --config-name mme_vla_suite --repo-id robomme --dataset-path="data/robomme_preprocessed_data"
uv run scripts/compute_norm_stats.py --config-name pi05_baseline --repo-id robomme --dataset-path="data/robomme_preprocessed_data"

This produces the following structure under runs:

.
├── assets
│   ├── mme_vla_suite
│   │   └── robomme
│   │       └── norm_stats.json
│   └── pi05_baseline
│       └── robomme
│           └── norm_stats.json

You can also compare against our reference norm_stats.json provided here to check whether your processing is correct. Small differences are acceptable.

🎛️ Train π₀.₅ baseline

This variant does not use history and fine-tunes the $\pi_{0.5}$ checkpoints with the vision encoder frozen (for comparison with MME-VLA):

bash scripts/finetune_pi05_baseline.sh

You can change --exp-name to suit your own experiment naming.

🧠 Train MME-VLA policies

bash scripts/finetune_mme_vla_suite.sh

Set MME_VLA_TYPE to train a specific model variant. You can also change --exp-name to suit your own experiment naming.

We provide a sample training-curve description in docs/training_curve_sample.md.

🧭 Train VLM subgoal predictor

robomme_preprocessed_data already contains VLM subgoal prediction data, but you can also generate it with:

uv run scripts/build_dataset.py --dataset_type vlm_subgoal_qwenvl  --raw_data_path=<downloaded_h5_data_dir> --preprocessed_data_path=<your_target_dir>
uv run scripts/build_dataset.py --dataset_type vlm_subgoal_memer  --raw_data_path=<downloaded_h5_data_dir> --preprocessed_data_path=<your_target_dir>

After the data is ready, run:

micromamba activate robomme
bash scripts/finetune_vlm_subgoal_predictor.sh

Set DATASET_PATH according to which VLM you are training: (1) simple subgoals, (2) grounded subgoals, or (3) MemER-style subgoals.

🧪 Evaluation

🚀 Evaluation with the integrated script

After downloading the fine-tuned checkpoints, run:

bash scripts/eval.sh

Set the MODEL_TYPE variable to one of the following:

  1. Prior methods: pi05_baseline, MemER
  2. Symbolic MME-VLA: symbolic_simpleSG_oracle, symbolic_simpleSG_gemini, symbolic_simpleSG_qwenvl, symbolic_groundedSG_oracle, symbolic_groundedSG_gemini, symbolic_groundedSG_qwenvl
  3. Perceptual MME-VLA: perceptual-framesamp-context, perceptual-framesamp-modul, perceptual-framesamp-expert, perceptual-tokendrop-context, perceptual-tokendrop-modul, perceptual-tokendrop-expert
  4. Recurrent MME-VLA: recurrent-rmt-context, recurrent-rmt-modul, recurrent-rmt-expert, recurrent-ttt-context, recurrent-ttt-modul, recurrent-ttt-expert

Running eval.sh automatically starts two tmux windows: one for the policy server and one for RoboMME evaluation. If the evaluation is interrupted, you can rerun the script; it will automatically resume from the generated progress.json.

✍️ Manual evaluation (per model)

Details are provided here.

🏆 RoboMME Challenge Example

We provide a policy-serving example in the challenge_interface directory for RoboMME Challenge submission.

We offer three ways for model submission:

  1. Docker-based submission: see details here.
  2. Remote API submission: see details here.
  3. GitHub repo submission: we will git clone your repo, install the environment, and run the policy server on our machine.

We highly recommend that you first fully understand the MME-VLA policy learning pipeline before diving into this section.

🛠️ Troubleshooting

Q1: Vulkan installation fails.
A1: Please refer to the ManiSkill solution. If it still does not work, we recommend reinstalling the NVIDIA driver and Vulkan packages. We use NVIDIA driver 570.211.01 and Vulkan 1.3.275. You can also switch to CPU rendering:

os.environ['SAPIEN_RENDER_DEVICE'] = 'cpu'
os.environ['MUJOCO_GL'] = 'osmesa'

Q2: Why does the evaluation stop?
A2: We observed that, on long-horizon tasks such as VideoPlaceButton, the WebSocket connection can break due to large video frames. If the evaluation process is interrupted, you can rerun scripts/eval.sh, and the program will resume based on the generated progress.json.

Q3: CUDA runs out of memory when training VLA models.
A3: You can set the environment variable XLA_PYTHON_CLIENT_MEM_FRACTION=0.95 to allow JAX to use more GPU memory.

🙏 Acknowledgement

This work was supported in part by NSF SES-2128623, NSF CAREER #2337870, NSF NRI #2220876, NSF NAIRR250085, and NSF IIS-1949634. We would also like to thank the excellent OpenPi codebase from Physical-Intelligence.

📝 Citation

@article{dai2026robomme,
  title={RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies},
  author={Dai, Yinpei and Fu, Hongze and Lee, Jayjun and and Liu, Yuejiang and Zhang, Haoran and Yang, Jianing and Finn, Chelsea and Fazeli, Nima and Chai, Joyce},
  journal={arXiv preprint arXiv:2603.04639},
  year={2026}
}

About

MME-VLA-Suite

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages