[Feature] New evaluation dimension: long_horizon_coherence

Motivation
All existing VBench dimensions that assess temporal quality (temporal_flickering, motion_smoothness, subject_consistency) are local metrics — they compare adjacent or nearby frames. None measure how visual coherence evolves over long rollouts.
This creates a measurable blind spot. A model can score perfectly on temporal_flickering while accumulating significant semantic drift over 50+ frames. Our experiments across 8 models and 3 scene categories confirm this consistently reveals model characteristics invisible to existing dimensions.

Proposed metric
Sample frame pairs at increasing temporal distances Δt = [2, 5, 10, 20, 50] frames. At each Δt, compute mean DINO cosine similarity across all valid pairs. Return a per-Δt degradation curve and a scalar summary score.

Flat curve → strong long-horizon coherence
Steep curve → compounding drift, invisible to adjacent-frame metrics


Automated evaluation results
Sanity checks
Synthetic stable video (identical frames): score 1.0000, perfectly flat ✅
Synthetic drifting video (linearly growing noise): score 0.9025, monotonically degrading ✅
Big Buck Bunny (complex natural motion): score 0.8422, steep curve ✅
Results reproducible — identical scores across two independent runs on NVIDIA A40.

Cross-model comparison — Camera Motion
Same prompt (Alhambra, pan left), 8 models, VBench-2.0 sampled videos:
Model           Score   dt=2   dt=5   dt=10  dt=20  dt=50
StepVideo       0.9839  0.995  0.990  0.983  0.980  0.971
CogVideo        0.9819  0.993  0.988  0.983  0.977  0.969
HunyuanVideo    0.9800  0.990  0.982  0.978  0.984  0.966
Kling           0.9785  0.994  0.988  0.983  0.976  0.951
Sora            0.9665  0.991  0.984  0.978  0.967  0.912
Wanx            0.9623  0.988  0.981  0.975  0.960  0.908
Vidu_Q1         0.9572  0.982  0.975  0.963  0.955  0.911
Veo3            0.9458  0.997  0.989  0.972  0.939  0.831
Finding: Veo3 scores highest at dt=2 (0.997 — near-perfect local smoothness) but lowest overall (0.9458) due to steep long-horizon drift. This pattern is completely invisible to existing adjacent-frame metrics.

Cross-category comparison
3 models × 2 scene categories:
Model           Category                Score   dt=2   dt=50
Sora            Complex Landscape       0.9873  0.996  0.980
Sora            Instance Preservation   0.9637  0.994  0.899
CogVideo        Complex Landscape       0.9162  0.984  0.774
CogVideo        Instance Preservation   0.9116  0.955  0.848
HunyuanVideo    Complex Landscape       0.8979  0.992  0.690
HunyuanVideo    Instance Preservation   0.8907  0.988  0.662
Finding 1 — Rankings are category-dependent. Sora ranks 5th on Camera Motion but 1st on Complex Landscape and Instance Preservation by a wide margin.
Finding 2 — HunyuanVideo shows severe long-horizon drift on complex scenes. dt=50 drops to 0.690 and 0.662 despite strong local smoothness at dt=2 (0.992, 0.988).
Finding 3 — Metric is physically sensitive. "A ball rolls off the table" scores lower than landscape scenes — correct, since rapid subject disappearance naturally degrades long-horizon feature similarity.

Human alignment validation (n=30, Prolific)
We recruited 30 participants via Prolific to validate whether long_horizon_coherence aligns with human perception of temporal consistency. Participants were shown 8 video pairs and asked: "Which video maintains its visual consistency better from start to finish?" Videos were labeled only as Video A / Video B — no model names or scores shown.
Spearman rank correlation between metric score differences and human preference rates: r = 0.854 (p = 0.007).
Per-pair results:
Pair  Category              Higher model   Score diff  %Higher  %Equal  %Lower  Agrees
p06   Complex Landscape     Sora           0.0711      93.3%    6.7%    0.0%    ✅
p07   Instance Preservation Sora           0.0730      83.3%    13.3%   3.3%    ✅
p01   Camera Motion         StepVideo      0.0381      63.3%    13.3%   23.3%   ✅
p02   Camera Motion         CogVideo       0.0247      50.0%    30.0%   20.0%   ➖
p03   Camera Motion         Kling          0.0162      46.7%    36.7%   16.7%   ❌
p04   Camera Motion         Sora           0.0207      46.7%    16.7%   36.7%   ❌
p05   Camera Motion         StepVideo      0.0216      46.7%    40.0%   13.3%   ❌
p08   Instance Preservation CogVideo       0.0209      23.3%    10.0%   66.7%   ❌
The pattern is interpretable and consistent:
When the metric score difference is large (>0.05), humans agree strongly — 83–93% chose the higher-scoring model. When the difference is small (<0.03), human judgment is near-chance (~23–50%), consistent with videos that are genuinely hard to distinguish perceptually. This is expected behavior for a valid metric operating near the threshold of human perception.
The one reversal (p08: CogVideo vs HunyuanVideo, diff=0.021) falls in the small-difference range where human judgment is unreliable, and should be interpreted as ambiguity rather than metric error.
Inter-annotator agreement (Fleiss κ = 0.04) reflects the high ambiguity of close pairs. On large-difference pairs, annotators were highly consistent: 93.3% agreement on p06, 83.3% on p07.

Implementation

No new model downloads — reuses DINO weights from subject_consistency
5 files: vbench/long_horizon_coherence.py (new), vbench/__init__.py, vbench/utils.py, vbench/VBench_full_info.json (+5 prompts), prompts/prompts_per_dimension/long_horizon_coherence.txt (100 prompts across 8 scene categories)
Validated on NVIDIA A40 GPU — results reproducible across independent runs
Branch: [dussaud:feature/long-horizon-coherence](https://github.com/dussaud/VBench/tree/feature/long-horizon-coherence)


Open questions for maintainers

Is this dimension direction welcome for VBench staging/v1?
Any preference on scalar summary — mean across Δt (current) vs AUC weighted by interval width?
Should the prompt suite be expanded further beyond 100 prompts before a formal PR?

Happy to iterate based on feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] New evaluation dimension: long_horizon_coherence #206

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] New evaluation dimension: long_horizon_coherence #206

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions