Taewoong Kang*, Kinam Kim*, Dohyeon Kim*, Minho Park, Junha Hyung, and Jaegul Choo
DAVIAN Robotics, KAIST AI, SNU
arXiv 2025. (* indicates equal contribution)
teaser.mp4
- Release inference code
- Release model weights
- Release data preprocessing code (for inference)
- Release training code
- Release data preprocessing code (for train)
- Release user-friendly interface
- GPU: < 80GB (for inference) < 140GB (for train)
- CUDA: 12.1 or higher
- Python: 3.10
- PyTorch: Compatible with CUDA 12.1
Create a conda environment and install dependencies:
# Create conda environment
conda create -n egox python=3.10 -y
conda activate egox
# Install PyTorch with CUDA 12.1
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Install other dependencies
pip install -r requirements.txtDownload the Wan2.1-I2V-14B model and save it to the checkpoints/pretrained_model/ folder.
pip install huggingface_hub
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Wan-AI/Wan2.1-I2V-14B-480P-Diffusers', local_dir='./checkpoints/pretrained_model/Wan2.1-I2V-14B-480P-Diffusers')"Download the trained EgoX LoRA weights using one of the following methods:
Option 1: Hugging Face
pip install huggingface_hub
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='DAVIAN-Robotics/EgoX', local_dir='./checkpoints/EgoX', allow_patterns='*.safetensors')"Option 2: Google Drive
- Download from Google Drive and save to the
checkpoints/EgoX/folder.
For quick testing, the codebase includes example data in the example/ directory. You can run inference immediately:
# For in-the-wild example
bash scripts/infer_itw.sh
# For Ego4D example
bash scripts/infer_ego4d.shEdit the GPU ID and seed in the script if needed. Results will be saved to ./results/.
To run inference with your own data, prepare the following file structure:
your_dataset/ # Your custom dataset folder
βββ meta.json # Meta information for each video
βββ videos/ # Videos directory
β βββ take_name/
β βββ ego_Prior.mp4
β βββ exo.mp4
β βββ ...
βββ depth_maps/ # Depth maps directory
βββ take_name/
βββ frame_000.npy
βββ ...
meta.json - Meta information for each video
JSON file containing exocentric video path, egocentric prior video path, prompt, camera intrinsic and extrinsic parameters for each video. The structure includes test_datasets array with entries for each videos.
Example:
{
"test_datasets": [
{
"exo_path": "./example/in_the_wild/videos/joker/exo.mp4",
"ego_prior_path": "./example/in_the_wild/videos/joker/ego_Prior.mp4",
"prompt": "[Exo view]\n**Scene Overview:**\nThe scene is set on a str...\n\n[Ego view]\n**Scene Overview:**\nFrom the inferred first-person perspective, the environment appears chaotic and filled with sm...",
"camera_intrinsics": [
[634.47327, 0.0, 392.0],
[0.0, 634.4733, 224.0],
[0.0, 0.0, 1.0]
],
"camera_extrinsics": [
[1.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0]
],
"ego_intrinsics": [
[150.0, 0.0, 255.5],
[0.0, 150.0, 255.5],
[0.0, 0.0, 1.0]
],
"ego_extrinsics": [
[[0.6263, 0.7788, -0.0336, 0.3432],
[-0.0557, 0.0018, -0.9984, 2.3936],
[-0.7776, 0.6272, 0.0445, 0.1299]],
...
]
},
...
]
}To prepare your own dataset, follow the instruction from here.
Since EgoX is trained on the Ego-Exo4D dataset where exocentric view camera poses are fixed, you must provide exocentric videos with fixed camera poses as input during inference. Also, the model is trained on 448x448(ego), 448x784(exo) resolutions and 49 frames. Please preprocess your videos to these resolutions.
Custom dataset init structure
Before running the script, you need to create a custom dataset folder with the following structure:
your_dataset/ # Your custom dataset folder
βββ videos/ # Videos directory
βββ take_name/
βββ exo.mp4
Then, by using meta_init.py, you can create a meta.json file with the following command:
python meta_init.py --folder_path ./your_dataset --output_json ./your_dataset/meta.json --overwrite
your_dataset/ # Your custom dataset folder
βββ meta.json # Meta information for each video
βββ videos/ # Videos directory
βββ take_name/
βββ exo.mp4
Then, you can use caption.py to generate caption for each video with this command:
python caption.py --json_file ./your_dataset/meta.json --output_json ./your_dataset/meta.json --overwrite
Make sure that your api key is properly set in caption.py.
Finally, follow the instruction from here. Then you can get depth maps, camera intrinsic, ego camera extrinsics for each video.
your_dataset/ # Your custom dataset folder
βββ meta.json # Meta information for each video
βββ videos/ # Videos directory
βββ take_name/
βββ ego_Prior.mp4
βββ exo.mp4
βββ ...
βββ depth_maps/ # Depth maps directory
βββ take_name/
βββ frame_000.npy
βββ ...
Then, modify scripts/infer_itw.sh (or create a new script) to point to your data paths:
python3 infer.py \
--meta_data_file ./example/your_dataset/meta.json \
--model_path ./checkpoints/pretrained_model/Wan2.1-I2V-14B-480P-Diffusers \
--lora_path ./checkpoints/EgoX/pytorch_lora_weights.safetensors \
--lora_rank 256 \
--out ./results \
--seed 42 \
--use_GGA \
--cos_sim_scaling_factor 3.0 \
--in_the_wildThis project is built upon the following works:
If you use this dataset or code in your research, please cite our paper:
@misc{kang2025egoxegocentricvideogeneration,
title={EgoX: Egocentric Video Generation from a Single Exocentric Video},
author={Taewoong Kang and Kinam Kim and Dohyeon Kim and Minho Park and Junha Hyung and Jaegul Choo},
year={2025},
eprint={2512.08269},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.08269},
}