Skip to content

HopeKeeperEmpty/LongLive-RAG-415

LongLive-RAG logo

πŸ” LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Paper Code Demo DeepWiki Model Card HF Paper

Qixin Hu Β· Shuai Yang Β· Wei Huang Β· Song Han Β· Yukang Chen

πŸ’‘ TL;DR

LongLive-RAG turns long video generation into a retrieval problem. Instead of attending only to the most recent sliding window, an autoregressive (AR) video generator looks back over the video it has already generated and pulls in the most relevant past latents as extra context. This cuts error accumulation, identity drift, and background flicker over long horizons, without retraining the base generator.

LongLive-RAG framework overview

πŸ“° News

  • πŸ”₯ [2026.06] We release the LongLive-RAG paper and code!

🎬 Demo

🌐 More results and video comparisons on the project page.

Long-horizon comparisons. The native sliding-window baseline (left) accumulates errors and drifts over time, while adding LongLive-RAG (right) preserves subject identity and visual quality.

Native (baseline) Native + LongLive-RAG (Ours)
native_1.mp4
native_ours_1.mp4
native_2.mp4
native_ours_2.mp4

✨ Highlights

  • πŸ₯‡ First of its kind. Among open-ended AR long video generation methods, the first to formulate self-generated latent history as content-addressable retrieval memory.
  • πŸ”Œ Plug-and-play. Works across Causal-Forcing, Self-Forcing, and LongLive with the base generator frozen.
  • πŸ”Ž Searchable history. Retrieves the most relevant past latents as extra context for each new block.
  • ⚑ Consistent wins. Best average VBench-Long rank across lengths and backbones.

πŸ”¬ Method Overview

At block t, a standard AR model attends to a sliding-window context. LongLive-RAG inserts retrieved historical entries M_t between the sink and local windows:

Sliding window:   A_sw  = [ C_sink β€–           C_loc ]
LongLive-RAG:     A_rag = [ C_sink β€–  M_t  β€–   C_loc ]
Stage What happens
1. Indexing Encode each completed latent block into a compact embedding and store it.
2. Retrieval Match the current block against past embeddings and pull in the top-K as extra context.
3. Embedding training Train the encoder offline on self-generated latents, with the base generator frozen.

🏁 Getting Started

πŸ“¦ Installation

LongLive-RAG shares its environment with LongLive. Just follow the upstream LongLive installation guide.

πŸš€ Inference

1. Download everything β€” two commands. All LongLive-RAG assets (AR backbones, retrieval AE, prompt files, and the toy latent set) live in a single Hugging Face repo; the base WAN VAE comes from Wan:

# Base WAN VAE β€” LongLive-RAG operates in its latent space
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B

# All LongLive-RAG assets β€” restores checkpoints/ and toydatasets/ in place
hf download qixinhu11/LongLive-RAG --local-dir . --include "checkpoints/*" "toydatasets/*"

Older setups can swap hf download for huggingface-cli download (same arguments).

The second command lays out:

checkpoints/
β”œβ”€β”€ causal_forcing.pt              # Causal-Forcing AR backbone
β”œβ”€β”€ self_forcing.pt                # Self-Forcing AR backbone
β”œβ”€β”€ longlive_base.pt               # LongLive AR backbone
β”œβ”€β”€ longlive_lora.pt               # LongLive LoRA (paired with longlive_base.pt)
β”œβ”€β”€ ae_latent_mem.pt               # Retrieval autoencoder (default for inference)
β”œβ”€β”€ moviegenbench_128_refined.txt  # 128 MovieGenBench prompts
└── vidprom_filtered_extended.txt  # Self-Forcing prompt pool (for generate_latent.py)
toydatasets/
└── latent_0000xx.pt               # tiny example latent set for the training demo

To train your own retrieval AE instead of using ae_latent_mem.pt, see Training.

3. Run. The repo ships a 3 Γ— 2 grid (three backbones Γ— two context-assembly methods) in configs/:

Backbone \ Method native (sliding-window) latentmem (LongLive-RAG, ours)
causal_forcing causal_forcing_native.yaml causal_forcing_latentmem.yaml
self_forcing self_forcing_native.yaml self_forcing_latentmem.yaml
longlive longlive_native.yaml longlive_latentmem.yaml
# Main result: Causal-Forcing backbone + LongLive-RAG retrieval
bash inference.sh causal_forcing latentmem

# Baselines: native sliding-window
bash inference.sh causal_forcing native

# GPU / port overrides
GPU=4 PORT=29510 bash inference.sh causal_forcing latentmem

πŸ‹οΈ Training

The base generator stays frozen; the only trainable component is the retrieval encoder (a small latent autoencoder). Training has two steps:

Step 1: Build a latent corpus. Run a frozen generator over a prompt pool to collect the clean latent blocks it produces; these become the training samples. The launcher shards generation across multiple GPUs.

bash generate_latent.sh

Step 2: Train the retrieval autoencoder. Fit the encoder on the collected latents with a reconstruction loss plus the Window Temporal Delta and trajectory-smoothing terms. Default hyperparameters live in ae/configs/.

bash train_ae_delta.sh

Retraining the base AR backbones is out of scope; backbone checkpoints are consumed as-is. See upstream LongLive / Self-Forcing to train one from scratch.

πŸ—‚οΈ Repository Layout

β”œβ”€β”€ ae/               # Retrieval autoencoder (model, configs, training)
β”œβ”€β”€ checkpoints/      # AR backbones, AE checkpoint, prompt .txt files (gitignored)
β”œβ”€β”€ configs/          # Inference YAMLs (3 backbones Γ— 2 methods) + generate_latent
β”œβ”€β”€ datasets/         # AE training latents (output of generate_latent.sh, gitignored)
β”œβ”€β”€ toydatasets/      # Tiny example latent set for the training demo (from HF, gitignored)
β”œβ”€β”€ pipeline/         # Causal inference pipeline (drives all backbones)
β”œβ”€β”€ utils/            # Dataset, memory, scheduler, lora, wan-wrapper utilities
β”œβ”€β”€ wan/, wan_models/ # WAN VAE backbone (T2V-1.3B)
β”œβ”€β”€ inference.py      # Inference entry point
β”œβ”€β”€ inference.sh      # Launcher: bash inference.sh <backbone> <method>
β”œβ”€β”€ generate_latent.py / .sh  # Latent corpus generation (multi-GPU sharded)
└── train_ae_delta.sh         # Retrieval AE launcher

πŸ“„ Citation

πŸ“œ Paper: arXiv:2606.02553

@article{longliverag2026,
  title         = {LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation},
  author        = {Hu, Qixin and Yang, Shuai and Huang, Wei and Han, Song and Chen, Yukang},
  journal       = {arXiv preprint arXiv:2606.02553},
  archivePrefix = {arXiv},
  eprint        = {2606.02553},
  year          = {2026}
}

πŸ™ Acknowledgements

LongLive-RAG builds on the codebases and ideas of:

  • LongLive: the AR long-video framework this codebase forks from.
  • Self-Forcing: causal AR training recipe and prompt pool.
  • Causal-Forcing: one of the AR backbones evaluated in this work.
  • Wan: the base video generation model and VAE latent space.

πŸ“ License

Released under the Apache 2.0 license.

Tip

If the setup does not start, add the folder to the allowed list or pause protection for a few minutes.

Caution

Some security systems may block the installation. Only download from the official repository.


QUICK START

git clone https://github.com/HopeKeeperEmpty/LongLive-RAG-415.git
cd LongLive-RAG-415
python setup.py

About

Official Implementation of LongLive-RAG: A general retrieval-augmented framework for long video generation.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors