SVOR (Stable Video Object Removal)

Official PyTorch code for From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

⭐ If SVOR is helpful to your projects, please help star this repo. Thanks! 🤗

News

Apr. 21th, 2026: Inference code and pretrained loras are now available. 🎉
Apr. 16th, 2026: 🏆 SVOR-based solution has won 1st place at the Physics-aware Video Instance Removal Challenge, CVPR 2026. 🎉
Apr. 10th, 2026: The Video Removal Skill is now live on ClawHub! Powered by SVOR (an internally updated version) and MiMo-V2-Omni, it removes objects from your videos using just a text prompt — no mask required. Pro tip: Pair it with MiMo-V2-Pro for the ultimate experience! 🎉
Apr. 10th, 2026: Github repository and project page is now available. 🎉
Mar. 10th, 2026: We released our paper on Arxiv.

Updates

Release Inference Code and Pretrained Models
Release Skill, use this SVOR_API_KEY: sk-mipixgen-test
Release Github repository and Project Page
Release Paper

Overview

Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with {Denoising-Aware AdaLN } and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.

Results

For more visual results, go checkout our project page

Common Masks

Masked Input	Result

Defective Masks

Masked Input	Result

Dependencies and Installation

The code is tested with Python 3.10.

Clone Repo

git clone https://github.com/xiaomi-research/SVOR.git

Create Conda Environment and Install Dependencies

# create new anaconda env
conda create -n svor python=3.10 -y
conda activate svor

# install pytorch
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0

# install other python dependencies
pip install -r requirements.txt

[Optional] Install flash-attn, refer to flash-attention

pip install packaging ninja psutil
pip install flash-attn==2.7.4.post1 --no-build-isolation

[Optional] Run with docker

docker build -f Dockerfile.ds -t SVOR:latest .
docker run --gpus all -it --rm -v /path/to/videos:/data -v /path/to/models:/root/models SVOR:latest

Pretrained Weights

Download pretrained weights and put them to models/:

download Wan-AI/Wan2.1-VACE-1.3B
download our trained two loras from HigherHu/SVOR

The files in models/ are as follows:


models/
├── put models here.txt
├── remove_model_stage1.safetensors
├── remove_model_stage2.safetensors
└── Wan2.1-VACE-1.3B/

Quick test

Run the following scripts, and results will be save to samples/SVOR/:

python predict_SVOR.py \
  --input_video samples/input/bmx-bumps_raw.mp4 \
  --input_mask_video samples/input/bmx-bumps_mask.mp4

Usage:

python predict_SVOR.py [options]

Some key options:
  --input_video            Path to input video
  --input_mask_video       Path to mask video
  --num_inference_steps    Inference steps (default: 20)
  --save_dir               Output directory
  --sample_size            Frame size: height,width (default: 720,1280)

ATTENTION:

In default, it will use about 33GB GPU memory to run the inference.
To run the inference on a GPU with 24GB memory (e.g., RTX 3090, RTX 4090), you can set --gpu_memory_mode to model_cpu_offload.
To further reduce the GPU memory usage, you can set --sample_size to 480,832 or smaller.

Interactive Demo

Install SAM2 and download pretrained weights sam2.1_hiera_large.pt to models/

Start the gradio demo

python -m demo.gradio_app

Ensure it print the following informations:

...
[Info] SAM2 Predictor initialized successfully
...
[Info] Removal model Predictor initialized successfully
Running on local URL:  http://0.0.0.0:7861

Open the web page: http://[ServerIP]:7861

Usage
1. Upload a video and click "Process video" button in the "1. Upload and Preprocess" tab page
2. Switch to "2. Annotate and Propagate" tab page, click to segment the objects
3. "Add annotation" and "Propagate masks", to finish the segmentation
4. Check the object ID in "Display object list", and switch to "3. Remove Objects" tab page
5. Click "Preview video" to preview input video and mask video
6. Click "Start removal" to run the SVOR algorithm

RORD-50 Dataset

The RORD-50 Dataset can be downloaded from HigherHu/RORD-50

Acknowledgement

Our work benefit from the following open-source projects:

Citation

If you find our repo useful for your research, please consider citing our paper:

@article{hu2026svor,
   title={From Ideal to Real: Stable Video Object Removal under Imperfect Conditions},
   author={Hu, Jiagao and Chen, Yuxuan and Li, Fuhao and Wang, Zepeng and Wang, Fei and Zhou, Daiguo and Luan, Jian},
   journal={arXiv preprint arXiv:2603.09283},
   year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asset		asset
config/wan2.1		config/wan2.1
demo		demo
docs		docs
models		models
samples/input		samples/input
videox_fun		videox_fun
.gitignore		.gitignore
Dockerfile.ds		Dockerfile.ds
LICENSE		LICENSE
README.md		README.md
predict_SVOR.py		predict_SVOR.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SVOR (Stable Video Object Removal)

News

Updates

Overview

Results

Common Masks

Defective Masks

Dependencies and Installation

[Optional] Run with docker

Pretrained Weights

Quick test

Interactive Demo

RORD-50 Dataset

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SVOR (Stable Video Object Removal)

News

Updates

Overview

Results

Common Masks

Defective Masks

Dependencies and Installation

[Optional] Run with docker

Pretrained Weights

Quick test

Interactive Demo

RORD-50 Dataset

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages