Danah Yatim*, Rafail Fridman*, Omer Bar-Tal, Tali Dekel
Weizmann Institute of Science
(* equal contribution)
teaser.mp4
This repository contains the official implementation of the paper DynVFX: Augmenting Real Videos with Dynamic Content
DynVFX augments real-world videos with new dynamic content described by a simple user-provided text instruction. The framework automatically infers where the synthesized content should appear, how it should move, and how it should harmonize at the pixel level with the scene, without requiring any additional user input. The key idea is to selectively extend the attention mechanism in a pre-trained text-to-video diffusion model, enforcing the generation to be content-aware of existing scene elements (anchors) from the original video. This allows the model to generate content that naturally interacts with the environment, producing complex and realistic video edits in a fully automated way.
For more, visit the project webpage.
videos.mp4
conda create -n dynvfx python=3.12
conda deactivate
conda activate dynvfx
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
git clone --recursive https://github.com/DanahYatim/dynvfx.git
cd dynvfx
If you already cloned without --recursive:
git submodule update --init --recursive
pip install -r requirements.txt
cd third_party/evfsam2
pip install -r requirements.txt
cd model/segment_anything_2
python setup.py build_ext --inplace
cd ../../..
This repository uses OpenAI's GPT-4o as the VFX Assistant. Create an API key at OpenAI Platform.
Save your key in vfx_assistant/.env:
OPENAI_API_KEY=<your_key>
# 1. Prepare your video frames (720x480, 49 frames at 8fps)
ffmpeg -i input.mp4 -vf "scale=720:480,fps=8" data/my_video/%05d.png
# 2. Edit configs/user_config.yaml with your paths and desired content
# 3. Run inversion to extract refference keys and values
python inversion.py --user_config_path configs/user_config.yaml
# 4. Run DynVFX
python run.py --user_config_path configs/user_config.yaml
Edit configs/user_config.yaml with the following parameters:
| Parameter | Description |
|---|---|
data_path |
Path to input video frames directory |
new_content |
Text instruction describing new content to add |
output_path |
Directory where output files will be saved |
target_folder |
name of edit, file name where edited video will be saved |
masks_dir |
Directory for prominent elements segmentation masks |
latents_path |
Directory for inverted latents |
mode |
Run mode: "auto", "generate", or "execute" |
See Tips section for configuration options.
Your input video should be provided as individual frames in a directory:
data/input_frames/
βββ 00000.png
...
βββ 00048.png
The method works best with:
- Resolution: 720Γ480
- Frame rate: 8 fps
- Frame count: 49 frames (~6 seconds)
Resize the video and extract the frames:
ffmpeg -i input.mp4 -vf "scale=720:480,fps=8" data/my_video/original/%05d.png
To extract the reference keys and values, we first obtain the intermediate latents by inverting the input video:
python inversion.py --user_config_path configs/user_config.yaml
Configuration - Make sure video_path and latents_path are set in your user_config.yaml file.
Note:
- π¬ For paper comparison: This step is REQUIRED
- π― For best quality: Run inversion for optimal scene alignment
- π² For quick testing: Can be skipped, but results may drift
The pipeline consists of three stages:
- π€ VFX Assistant β GPT-4o interprets the edit instruction and generates captions
- π Text-based Segmentation β EVF-SAM extracts masks of scene elements
- π¬ DynVFX Pipeline β Iterative refinement with AnchorExtAttnβ¨
Run the entire pipeline in one command:
# In configs/user_config.yaml
mode: "auto"python run.py --user_config_path configs/user_config.yaml
**Stage 1: Generate π€ VFX Assistant + EVF-SAM outputs and review
# In configs/user_config.yaml
mode: "generate"python run.py --user_config_path configs/user_config.yaml
π Review the generated protocol at output_path/output_for_vfx_protocol.json and masks in masks_dir.
Stage 2: Execute with the approved protocol
# In configs/user_config.yaml
mode: "execute"python run.py --user_config_path configs/user_config.yaml
dynvfx/
βββ configs/
β βββ base_config.yaml # Pipeline hyperparameters
β βββ user_config.yaml # User-specific settings
β βββ inversion_config.yaml # Inversion settings
βββ models/
β βββ get_masks_from_sam.py # SAM mask generation
β βββ get_source_mask.py # Source mask extraction
βββ utilities/
β βββ attention_utils.py # Extended attention modules
β βββ masking_utils.py # Mask processing utilities
β βββ utils.py # General utilities
βββ vfx_assistant/
β βββ protocol.py # VFX Assistant (GPT-4o)
β βββ system_prompts.py # System prompts
β βββ .env # API keys (create this)
βββ third_party/
β βββ evfsam2/ # EVF-SAM installation
βββ dynvfx_pipeline.py # Main pipeline
βββ inversion.py # DDIM inversion
βββ run.py # Entry point
βββ requirements.txt # Dependencies
Enable logging to save intermediate results:
# In configs/base_config.yaml
with_logger: TrueThis saves to output_path:
- Input video and source masks
- Intermediate samples and target masks
- Latent mask visualizations
This work builds on:
- EVF-SAM - Base text-prompted segmentation model
- CogVideoX-5B - Base text-to-video model
- ChatGPT - Base visual language model
If you use this work, please cite:
@misc{yatim2025dynvfxaugmentingrealvideos,
title={DynVFX: Augmenting Real Videos with Dynamic Content},
author={Danah Yatim and Rafail Fridman and Omer Bar-Tal and Tali Dekel},
year={2025},
eprint={2502.03621},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.03621},
}