MBench

MBench is a benchmark for the memory capability of long-video world models. Most existing benchmarks reward single-frame quality or short-horizon prompt following; MBench targets a harder question: when a subject leaves the frame and returns, when the camera departs from a viewpoint and comes back, or when an off-screen physical process keeps evolving, can the model maintain a consistent world state? We decompose this into three orthogonal capability axes — Entity / Environment / Causal Consistency — and evaluate them under two complementary settings: MBench-A (action-conditioned, for action-conditioned world models) and MBench-T (text-segment-conditioned, for long-video text continuation models).

This repository ships the full evaluation pipeline: a contract-driven, plugin-based CLI tool mbench, 12 official metric implementations spanning both settings, and an integrated VLM trigger judge for trigger-conditioned scoring.

If you run into problems, please open an issue.

Overview

MBench formalises "long-video memory capability" as three orthogonal capability axes:

Entity Consistency — does a subject's identity, appearance, geometry, and texture stay consistent after it exits the frame and returns?
Environment Consistency — after the camera departs from a viewpoint and returns, does the 3D spatial layout, lighting, and rendering style recover?
Causal Consistency — does an off-screen physical process continue plausibly? Does the model faithfully respond to external instructions (camera actions, segment-level captions)?

Each axis decomposes into 4 sub-dimensions for a total of 12 evaluation dimensions. Each dimension is computed by a dedicated automatic metric and normalised to the [0, 1] range. Where the setting calls for it, the metric is gated by a Vision-Language Model trigger so that samples that never enter the memory challenge (e.g. completely static videos) cannot inflate model rankings. See the Trigger-Conditioned Scoring section below.

The two settings:

Setting	`dataset_id`	Conditioning	Typical models
MBench-A	`mbencha`	Camera action + a single caption sentence	hy_worldplay, matrix_game_3, yume, infinite_world, lingbot_world, matrix_game_2
MBench-T	`mbencht`	Five-segment text prompts (`condition_id=text`)	cosmos, longcat, skyreels, longlive, helios, memflow, self_forcing, causal_forcing

Both settings share the same samples/{subset}/{sample_id}/ directory layout; at run time only the metrics under the matching prefix are dispatched.

Updates

2026-06 — Initial public release: 12 metric implementations, MBench-A / MBench-T data adapter, canonical score schema, integrated VLM trigger judge, contract-validated CLI.

Installation

Requirements: Python ≥ 3.10.

git clone https://github.com/study-overflow/MBench.git
cd MBench
pip install -e .

Additional dependencies:

Metric family	Required extras
Spatial epipolar / reprojection, object geometry	`numpy`, `opencv-python`, plus a precomputed camera-pose artifact (you can produce it with DepthAnything v3)
Rendering style	`torch`, `torchvision`, `lpips`, `transformers` (DINOv2), VGG-19 weights
Rendering lighting	`opencv-python`
Entity human (identity / appearance)	`insightface` (Buffalo_L), `torch`, `transformers` (DINOv2)
Entity object (texture / geometry)	`groundingdino`, `segment-anything-2`, `transformers` (DINOv2)
Causal text / action interaction	`open_clip_torch`
Causal self-evolution (state / correctness)	An OpenAI-compatible VLM endpoint

A typical full setup:

pip install torch torchvision torchaudio   # match your CUDA
pip install opencv-python lpips transformers open_clip_torch insightface
pip install "segment-anything-2 @ git+https://github.com/facebookresearch/segment-anything-2.git"
pip install "groundingdino @ git+https://github.com/IDEA-Research/GroundingDINO.git"

Verify the install:

mbench list-metrics    # should print 23 registered metrics

Quick Start

Assuming your data is already laid out under data/MBench-A-Setup/ (see the next section):

# 1. Validate that the metric's required fields are present on a few samples.
mbench validate data/MBench-A-Setup \
    --metrics mbencha.environment.spatial_epipolar \
    --models my_model --subsets environment --limit 4

# 2. Run a real evaluation: one metric, one model.
mbench eval data/MBench-A-Setup \
    --metrics mbencha.environment.spatial_epipolar \
    --models my_model --subsets environment \
    --output runs/my_first_run

For an MBench-T run with a VLM trigger:

export MBENCH_VLM_BASE_URL="https://your-openai-compat-endpoint.example.com/v1"
export MBENCH_VLM_API_KEY="sk-..."
export MBENCH_VLM_MODEL="gpt-4o"

mbench eval data/MBench-T-Setup \
    --metrics mbencht.entity.human_identity_consistency \
    --models my_model --subsets human \
    --vlm-judge openai-compatible \
    --output runs/my_t_run_with_trigger

Without --vlm-judge, the same command still runs in raw-consistency mode (no API call); the metric still emits score, just without trigger gating.

Data Preparation

You can follow this section to lay out the data your own model produces, and evaluate your own model:

data/MBench-{A|T}-Setup/
├── dataset.yaml
├── samples/{subset}/{sample_id}/
│   ├── sample.json                   # metadata (object_anchor, caption_segments, …)
│   └── reference.png / source_video.mp4         # optional ground-truth assets
└── models/{model_id}/
    ├── samples.jsonl                 # one line per (sample × condition) generation
    ├── outputs/{subset}/{sample_id}/{condition_id}/video.mp4
    └── artifacts/{subset}/{sample_id}/{condition_id}/da3/results.npz
                                      # only required for spatial / object-geometry / causal-action metrics

The four subsets (human, object, environment, causal) match the metric subsets one-to-one. Currently condition_id is {action}_{length} for MBench-A (e.g. left_then_right_25s, forward_then_backward_10s) and the literal string text for MBench-T. You are free to add new conditions of your own (e.g. new camera-trajectory conditions).

A minimal dataset.yaml:

dataset_id: mbencha           # or mbencht
dataset_name: My MBench-A Setup
version: '1.0'
path_mode: relative_to_dataset_root

subsets:
  environment: {description: Environment / spatial memory.}
  human:       {description: Human entity identity memory.}
  object:      {description: Object entity identity memory.}
  causal:      {description: Causal / interaction memory.}

condition_id:
  pattern: "{action}_{length}"

paths:
  output_video: models/{model_id}/outputs/{subset}/{sample_id}/{condition_id}/video.mp4
  artifact_da3: models/{model_id}/artifacts/{subset}/{sample_id}/{condition_id}/da3/results.npz

A samples.jsonl line per generated rollout:

{"item_id": "human:my_sample_001:left_then_right_25s", "dataset_id": "mbencha", "subset": "human", "sample_id": "my_sample_001", "condition_id": "left_then_right_25s", "model_id": "my_model", "media": {"videos": [{"path": "outputs/human/my_sample_001/left_then_right_25s/video.mp4", "role": "generated"}]}, "artifacts": {"da3": {"path": "artifacts/human/my_sample_001/left_then_right_25s/da3/results.npz"}}}

sample.json required fields by subset:

object: T side requires metadata.object_card, A side requires metadata.object_anchor, describing the target object.
causal: T side requires metadata.caption_segments, A side requires annotations.action or metadata.caption.
environment / human: only media.videos is required; segment-level captions help the T-side trigger judge.

DA3 camera-pose artifacts are produced externally — see DA3; MBench only consumes the resulting results.npz.

Evaluation Dimensions

The 12 dimensions and their registration names:

Axis	Sub-dimension	MBench-A	MBench-T
Entity	Human Identity	`mbencha.entity.human_identity_consistency`	`mbencht.entity.human_identity_consistency`
	Human Appearance	`mbencha.entity.human_appearance_consistency`	`mbencht.entity.human_appearance_consistency`
	Object Texture	`mbencha.entity.object_texture_consistency`	`mbencht.entity.object_texture_consistency`
	Object Geometry	`mbencha.entity.object_geometry_consistency`	`mbencht.entity.object_geometry_consistency`
Environment	Spatial Epipolar	`mbencha.environment.spatial_epipolar`	`mbencht.environment.spatial_epipolar`
	Spatial Reprojection	`mbencha.environment.spatial_reprojection`	`mbencht.environment.spatial_reprojection`
	Rendering Style	`mbencha.environment.rendering_style`	`mbencht.environment.rendering_style`
	Rendering Lighting	`mbencha.environment.rendering_lighting`	`mbencht.environment.rendering_lighting`
Causal	Action Interaction	`mbencha.causal.camera_interaction`	–
	Text Interaction	`mbencha.causal.prompt_interaction`	`mbencht.causal.prompt_interaction`
	Self-Evolution: State	`mbencha.causal.state_progress`	`mbencht.causal.state_progress`
	Self-Evolution: Correctness	`mbencha.causal.progress_correctness`	`mbencht.causal.progress_correctness`

mbench list-metrics prints the live registry with one-line descriptions per metric.

Trigger-Conditioned Scoring

A model can game a memory benchmark by generating static, overly conservative content — never engaging with the challenge — and walk away with an inflated consistency score. MBench prevents this by decoupling reliability from coverage:

Memory Reliability $S^{\text{rel}}$ — average consistency, computed only on samples that actually triggered the memory challenge.
Trigger Coverage $C^{\text{trig}}$ — fraction of samples that triggered the memory challenge.
M-Score — the harmonic mean of the two: $\text{M-Score} = 2 \cdot S^{\text{rel}} \cdot C^{\text{trig}} / (S^{\text{rel}} + C^{\text{trig}})$.

Where the trigger fires:

Setting	Trigger source	Default behaviour
MBench-T	A VLM judge over 8 uniformly sampled frames + caption segments	OFF; opt in by passing `--vlm-judge openai-compatible`
MBench-A	Not used (action-conditioned rollouts deterministically actuate the memory challenge — paper §App)	Always OFF, even if `--vlm-judge` is passed

Configure the VLM via CLI flags or environment variables:

export MBENCH_VLM_BASE_URL="..."     # OpenAI-compatible /v1 endpoint
export MBENCH_VLM_API_KEY="..."
export MBENCH_VLM_MODEL="..."        # any vision-capable model id

# or pass per-run:
mbench eval ... --vlm-judge openai-compatible \
    --vlm-base-url "$MBENCH_VLM_BASE_URL" \
    --vlm-api-key  "$MBENCH_VLM_API_KEY" \
    --vlm-model    "$MBENCH_VLM_MODEL"

For ablation runs, pass --ignore-trigger to force the trigger gate off even when --vlm-judge is set.

Score Schema

Regardless of A or T, regardless of internal computation, every metric emits the same score shape at the unit, item, and summary layers:

Key	Range	When present	Meaning
`score`	[0, 1]	Always (when valid)	The normalised score.
`score_raw`	native unit	Distance- or VLM-1-5-based metrics	Raw signal before normalisation; omitted when `raw == score`.
`triggered`	{0.0, 1.0}	T-side metrics only	Whether the VLM trigger fired (0 → sample skipped from the reliability average).
`trigger_score`	[0, 1]	T-side metrics only	Normalised VLM trigger confidence.

Examples (after mbench eval):

// mbencha.environment.spatial_epipolar  units.jsonl line
{"unit_id": "...__f12_f48", "scores": {"score": 0.5523, "score_raw": 2.964}}

// mbencht.causal.progress_correctness  summary.json group
{"score": 0.74, "score_raw": 3.7, "triggered": 1.0, "trigger_score": 0.6}

// mbencht.* untriggered sample
{"triggered": 0.0, "trigger_score": 0.2}

Usage Reference

List registered metrics

mbench list-metrics

Validate (strict)

mbench validate <source> --metrics <csv> [--models …] [--subsets …] [--limit N]

Any contract violation exits non-zero.

Eval (lenient — skips bad items, runs the rest)

mbench eval <source> --metrics <csv> --output runs/<run_name>
    [--models <csv>] [--subsets <csv>] [--conditions <csv>]
    [--sample-ids <csv>] [--limit N]
    [--vlm-judge openai-compatible
       --vlm-base-url <URL> --vlm-api-key <KEY> --vlm-model <ID>
       --vlm-n-frames 8 --vlm-max-retries 3]
    [--workers N]                 # parallel across items
    [--ignore-trigger]            # bypass VLM trigger gate (ablation only)
    [--no-progress-bar] [--quiet] [--no-log-file]

Prepare auxiliary artifacts (never auto-invoked during eval)

mbench prepare <source> --producer <name> --output artifacts/

Python API

from mbench.core.pipeline import run_eval
from mbench.core.registry import metric_registry, adapter_registry, aggregator_registry
import mbench.cli; mbench.cli._bootstrap()    # register all built-in plugins

adapter = adapter_registry.get("mbench-dir")
items = adapter.load("data/MBench-A-Setup", subsets=["environment"], limit=4)
metric = metric_registry.get("mbencha.environment.spatial_epipolar")
aggregator = aggregator_registry.get("mean")

summary = run_eval(items, [metric], aggregator, "runs/python_api_demo")

Output Layout

runs/<run_name>/
├── items.jsonl              # one line per evaluated sample (item_id, model_id)
├── eval.log                 # per-item progress log (timestamps, scores, errors)
└── metrics/<metric_name>/
    ├── units.jsonl          # one UnitResult per evaluation unit (frame pair, track, segment, …)
    ├── items.jsonl          # one ItemResult per sample (aggregated from units)
    └── summary.json         # SummaryResult grouped by model_id

UnitResult.metadata['raw_scores'] preserves the metric's original internal fields (the ones dropped by the canonical projection), useful for after-the-fact debugging or re-deriving alternative score keys offline.

Citation

If MBench helps your research, please cite us:

@article{zhang2026mbench,
  title         = {MBench: A Comprehensive Benchmark on Memory Capability for Video World Models},
  author        = {Zhang, Shengjun and Zhang, Zhang and Huang, Simin and Tang, Zhenyu and Wang, Hanyang and Dai, Chensheng and Chen, Min and Li, Yifan and Li, Yuxin and Chen, Yingjie and Liu, Hao and Li, Chen and Duan, Yueqi},
  year          = {2026},
  eprint        = {2606.00793},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.00793}
}

License

The code is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
demo		demo
mbench		mbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MBench

Table of Contents

Overview

Updates

Installation

Quick Start

Data Preparation

Evaluation Dimensions

Trigger-Conditioned Scoring

Score Schema

Usage Reference

List registered metrics

Validate (strict)

Eval (lenient — skips bad items, runs the rest)

Prepare auxiliary artifacts (never auto-invoked during eval)

Python API

Output Layout

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MBench

Table of Contents

Overview

Updates

Installation

Quick Start

Data Preparation

Evaluation Dimensions

Trigger-Conditioned Scoring

Score Schema

Usage Reference

List registered metrics

Validate (strict)

Eval (lenient — skips bad items, runs the rest)

Prepare auxiliary artifacts (never auto-invoked during eval)

Python API

Output Layout

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages