OmnimatteZero

Official implementation of OmnimatteZero: Training-Free Video Matting and Compositing via Latent Diffusion Models

OmnimatteZero is a training-free approach for video matting, object removal, and layer composition using pre-trained video diffusion models. It leverages the powerful priors learned by video generation models to achieve high-quality results without any task-specific training.

Features

Object Removal: Remove objects and their effects (shadows, reflections) from videos
Foreground Extraction: Extract foreground layers with associated effects
Layer Composition: Compose extracted foreground layers onto new backgrounds
All operations are training-free and work with off-the-shelf video diffusion models

Installation

Requirements

Python 3.8+
CUDA-capable GPU (32GB+ VRAM recommended)
PyTorch 2.4+

Setup

# Clone the repository
git clone https://github.com/your-repo/OmnimatteZero.git
cd OmnimatteZero

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Dependencies

The main dependencies include:

torch>=2.4.0
diffusers>=0.31.0
transformers>=4.49.0
accelerate>=1.1.1

Quick Start

Data Preparation

Your input data should be organized as follows:

example_videos/
├── your_video_name/
│   ├── video.mp4          # Original video with object
│   ├── object_mask.mp4    # Mask of the object only
│   └── total_mask.mp4     # Mask including object + effects (shadows, reflections)

video.mp4: The original input video
object_mask.mp4: Binary mask video showing only the object pixels (white = object, black = background)
total_mask.mp4: Binary mask video showing the object AND its effects (shadows, reflections, etc.)

You can generate masks using SAM2.

Object Removal

Remove an object and its effects (shadows, reflections) from a video.

Usage

python object_removal.py

Configuration

Edit the following parameters in object_removal.py:

# Input directory containing video folders
base_dir = "example_videos"

# Output resolution
expected_height, expected_width = 512, 768

# Generation parameters
num_inference_steps = 30  # More steps = higher quality but slower

How it works

Loads the original video and total mask
Uses the mask to indicate regions to inpaint
The diffusion model fills in the masked regions while maintaining temporal consistency
Outputs a clean background video without the object

Output

Results are saved to the results/ directory:

results/
├── video_name.mp4

Generating total_mask from object_mask

If you only have the object_mask, you can automatically generate total_mask using self-attention:

python self_attention_map.py --video_folder ./example_videos/your_video_name

This uses the diffusion model's self-attention to find regions that are semantically related to the object (like shadows and reflections).

How It Works

The self-attention mask generation leverages the internal attention patterns of the video diffusion model:

Video Encoding: The input video is encoded to latent space using the VAE
Noise Injection: A controlled amount of noise is added to the latents (flow matching at t=0.5)
Attention Extraction: A forward pass through the transformer extracts self-attention maps from all 48 layers
Spatial-Temporal Attention: For each frame, computes how much each spatial position attends to object regions across all frames
Mask Generation: Attention values are upsampled, thresholded, and combined with the object mask

The key insight is that regions affected by the object (shadows, reflections) will have high attention weights to the object itself.

Parameters

Parameter	Default	Description
`--video_folder`	(required)	Folder containing `video.mp4` and `object_mask.mp4`
`--height`	512	Processing height
`--width`	768	Processing width
`--threshold`	adaptive	Attention threshold (None = mean + 0.5*std)
`--dilation`	3	Morphological dilation kernel size for smoothing

Example

# Recommended settings (works well for most videos)
python self_attention_map.py --video_folder example_videos/cat_reflection --height 512 --width 768 --threshold 0.07 --dilation 3

# Basic usage with adaptive threshold
python self_attention_map.py --video_folder ./example_videos/cat_reflection

# With custom threshold for more/less effect coverage
python self_attention_map.py --video_folder ./example_videos/cat_reflection --threshold 0.08

# Higher resolution processing
python self_attention_map.py --video_folder ./example_videos/cat_reflection --height 720 --width 1280

Output

The script generates total_mask.mp4 in the same folder as the input, which includes:

The original object mask
Detected effects regions (shadows, reflections, etc.)

Typical mask expansion ratios are 1.5x-2.0x depending on the scene's effects.

💡 Refine with SAM2

For even better results, you can use the attention-based mask as a prompt for SAM2: This two-stage approach combines the semantic understanding of the diffusion model's attention (which knows what regions are related to the object) with SAM2's precise boundary detection (which knows exactly where those regions are).

Important Note on Attention Guidance

⚠️ Note (Updated 1/26): While our paper describes Temporal Attention Guidance and Spatial Attention Guidance using TAP-Net for improved temporal consistency, these features were originally developed for LTX-Video-0.9.1. We recently encountered a bug while fetching the LTX-0.9.1 model (as of 1/26) which we did not encounter during paper submission (5/25). We are working to fix this issue.

In the meantime, we have upgraded to LTX-Video-0.9.7, which we found achieves good object removal results without the explicit temporal and spatial attention guidance. The current code runs without attention guidance, but the implementation can be found in attention_guidance.py for reference.

Foreground Extraction & Layer Composition

Extract the foreground layer (object + its effects) from a video and compose it onto a new background.

Prerequisites

Before running, you need:

The original video with the object
A clean background video (from object removal)
Object mask and total mask videos
A new background video to compose onto

Usage

python foreground_composition.py

Configuration

Edit the following parameters in foreground_composition.py:

# Output resolution
w, h = 768, 512

# Video folder name
video_folder = "swan_lake"

# New background to compose onto
video_new_bg = load_video("./results/cat_reflection.mp4")

How it works

Encodes both the original video and clean background to latent space
Computes the latent difference (foreground = original - background)
Uses pixel injection for the object region to preserve details
Encodes the new background video to latent space
Adds the foreground latents to the new background latents
Applies refinement through a few noising-denoising steps

Output

Results are saved to results/:

results/
├── foreground.mp4       # Extracted foreground layer
├── latent_addition.mp4  # Latent addition result (before refinement)
└── refinement.mp4       # Final refined composition

Generation Parameters Tips

num_inference_steps: 30 is a good default; increase for better quality
guidance_scale: Default is 3.0; adjust based on prompt specificity
denoise_strength: For refinement, 0.3 works well; lower values preserve more details

Citation

If you find this work useful, please cite our paper:

@inproceedings{samuel2025omnimattezero,
  author    = {Dvir Samuel and Matan Levy and Nir Darshan and Gal Chechik and Rami Ben-Ari},
  title     = {OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models},
  booktitle = {SIGGRAPH Asia 2025 Conference Papers},
  year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
example_videos		example_videos
results		results
LICENSE		LICENSE
OmnimatteZero.py		OmnimatteZero.py
README.md		README.md
attention_guidance.py		attention_guidance.py
foreground_composition.py		foreground_composition.py
object_removal.py		object_removal.py
requirements.txt		requirements.txt
self_attention_map.py		self_attention_map.py

Folders and files

Latest commit

History

Repository files navigation

OmnimatteZero

Features

Installation

Requirements

Setup

Dependencies

Quick Start

Data Preparation

Object Removal

Usage

Configuration

How it works

Output

Generating total_mask from object_mask

How It Works

Parameters

Example

Output

💡 Refine with SAM2

Important Note on Attention Guidance

Foreground Extraction & Layer Composition

Prerequisites

Usage

Configuration

How it works

Output

Generation Parameters Tips

Citation

Acknowledgments

Troubleshooting

Slow Generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages