Skip to content

[Feature] New evaluation dimension: long_horizon_coherence #206

@dussaud

Description

@dussaud

Motivation
All existing VBench dimensions that assess temporal quality (temporal_flickering, motion_smoothness, subject_consistency) are local metrics — they compare adjacent or nearby frames. None measure how visual coherence evolves over long rollouts.
This creates a measurable blind spot. A model can score perfectly on temporal_flickering while accumulating significant semantic drift over 50+ frames. Our experiments across 8 models and 3 scene categories confirm this consistently reveals model characteristics invisible to existing dimensions.

Proposed metric
Sample frame pairs at increasing temporal distances Δt = [2, 5, 10, 20, 50] frames. At each Δt, compute mean DINO cosine similarity across all valid pairs. Return a per-Δt degradation curve and a scalar summary score.

Flat curve → strong long-horizon coherence
Steep curve → compounding drift, invisible to adjacent-frame metrics

Automated evaluation results
Sanity checks
Synthetic stable video (identical frames): score 1.0000, perfectly flat ✅
Synthetic drifting video (linearly growing noise): score 0.9025, monotonically degrading ✅
Big Buck Bunny (complex natural motion): score 0.8422, steep curve ✅
Results reproducible — identical scores across two independent runs on NVIDIA A40.

Cross-model comparison — Camera Motion
Same prompt (Alhambra, pan left), 8 models, VBench-2.0 sampled videos:
Model Score dt=2 dt=5 dt=10 dt=20 dt=50
StepVideo 0.9839 0.995 0.990 0.983 0.980 0.971
CogVideo 0.9819 0.993 0.988 0.983 0.977 0.969
HunyuanVideo 0.9800 0.990 0.982 0.978 0.984 0.966
Kling 0.9785 0.994 0.988 0.983 0.976 0.951
Sora 0.9665 0.991 0.984 0.978 0.967 0.912
Wanx 0.9623 0.988 0.981 0.975 0.960 0.908
Vidu_Q1 0.9572 0.982 0.975 0.963 0.955 0.911
Veo3 0.9458 0.997 0.989 0.972 0.939 0.831
Finding: Veo3 scores highest at dt=2 (0.997 — near-perfect local smoothness) but lowest overall (0.9458) due to steep long-horizon drift. This pattern is completely invisible to existing adjacent-frame metrics.

Cross-category comparison
3 models × 2 scene categories:
Model Category Score dt=2 dt=50
Sora Complex Landscape 0.9873 0.996 0.980
Sora Instance Preservation 0.9637 0.994 0.899
CogVideo Complex Landscape 0.9162 0.984 0.774
CogVideo Instance Preservation 0.9116 0.955 0.848
HunyuanVideo Complex Landscape 0.8979 0.992 0.690
HunyuanVideo Instance Preservation 0.8907 0.988 0.662
Finding 1 — Rankings are category-dependent. Sora ranks 5th on Camera Motion but 1st on Complex Landscape and Instance Preservation by a wide margin.
Finding 2 — HunyuanVideo shows severe long-horizon drift on complex scenes. dt=50 drops to 0.690 and 0.662 despite strong local smoothness at dt=2 (0.992, 0.988).
Finding 3 — Metric is physically sensitive. "A ball rolls off the table" scores lower than landscape scenes — correct, since rapid subject disappearance naturally degrades long-horizon feature similarity.

Human alignment validation (n=30, Prolific)
We recruited 30 participants via Prolific to validate whether long_horizon_coherence aligns with human perception of temporal consistency. Participants were shown 8 video pairs and asked: "Which video maintains its visual consistency better from start to finish?" Videos were labeled only as Video A / Video B — no model names or scores shown.
Spearman rank correlation between metric score differences and human preference rates: r = 0.854 (p = 0.007).
Per-pair results:
Pair Category Higher model Score diff %Higher %Equal %Lower Agrees
p06 Complex Landscape Sora 0.0711 93.3% 6.7% 0.0% ✅
p07 Instance Preservation Sora 0.0730 83.3% 13.3% 3.3% ✅
p01 Camera Motion StepVideo 0.0381 63.3% 13.3% 23.3% ✅
p02 Camera Motion CogVideo 0.0247 50.0% 30.0% 20.0% ➖
p03 Camera Motion Kling 0.0162 46.7% 36.7% 16.7% ❌
p04 Camera Motion Sora 0.0207 46.7% 16.7% 36.7% ❌
p05 Camera Motion StepVideo 0.0216 46.7% 40.0% 13.3% ❌
p08 Instance Preservation CogVideo 0.0209 23.3% 10.0% 66.7% ❌
The pattern is interpretable and consistent:
When the metric score difference is large (>0.05), humans agree strongly — 83–93% chose the higher-scoring model. When the difference is small (<0.03), human judgment is near-chance (~23–50%), consistent with videos that are genuinely hard to distinguish perceptually. This is expected behavior for a valid metric operating near the threshold of human perception.
The one reversal (p08: CogVideo vs HunyuanVideo, diff=0.021) falls in the small-difference range where human judgment is unreliable, and should be interpreted as ambiguity rather than metric error.
Inter-annotator agreement (Fleiss κ = 0.04) reflects the high ambiguity of close pairs. On large-difference pairs, annotators were highly consistent: 93.3% agreement on p06, 83.3% on p07.

Implementation

No new model downloads — reuses DINO weights from subject_consistency
5 files: vbench/long_horizon_coherence.py (new), vbench/init.py, vbench/utils.py, vbench/VBench_full_info.json (+5 prompts), prompts/prompts_per_dimension/long_horizon_coherence.txt (100 prompts across 8 scene categories)
Validated on NVIDIA A40 GPU — results reproducible across independent runs
Branch: dussaud:feature/long-horizon-coherence

Open questions for maintainers

Is this dimension direction welcome for VBench staging/v1?
Any preference on scalar summary — mean across Δt (current) vs AUC weighted by interval width?
Should the prompt suite be expanded further beyond 100 prompts before a formal PR?

Happy to iterate based on feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions