InfCam: Infinite-Homography as Robust Conditioning for Camera-Controlled
Video Generation

Min-Jung Kim*, Jeongho Kim*, Hoiyeong Jin, Junha Hyung, Jaegul Choo
*Indicates Equal Contribution
GSAI preview

Abstract

Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory–video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose–faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models.
To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data.

Motivation

Motivation

In the reprojection-based approach, inaccuracies in the depth estimation lead to unreliable conditioning, consequently introducing artifacts in the generated frame. In contrast, based on the fact that reprojection can be expressed by the following equation,

Equation: Reprojection formula
our infinite homography-based approach conditions on noise-free frame warped by infinite homography. This forces the model to concentrate on learning the parallax relative to the plane at infinity. This parallax is spatially constrained to the region between the epipole e' and the warped point x_inf, as visualized by the yellow segment on the epipolar line l’. This constraint reduces the search space, enabling the model to achieve higher camera pose fidelity.

Approach

Model Architecture

Infinite Homography vs. Reprojection-based Conditioning

(a) DiT block with homography-guided self-attention layer. The homography-guided self-attention layer takes source, target, and warped latents, combined with camera embeddings as input, and performs per-frame attention, ensuring temporal alignment. By conditioning on warped latents, the model enables rotation-aware reasoning and constrained parallax estimation. Only the source and target latents proceed to the subsequent Wan2.1 layers.
(b) Warping module. This module warps the input latent with infinite homography to handle rotation, then adds camera embeddings to account for translation. This decomposition simplifies reprojection to parallax estimation relative to the plane at infinity, enabling higher camera trajectory fidelity.

Data Augmentation

Model Architecture

To remove bias from existing datasets, we augment the MultiCamVideo dataset, named AugMCV. Unlike the existing SynCamVideo and MultiCamVideo datasets, our AugMCV dataset includes camera trajectories with varying initial camera poses and different focal lengths.

Experimental Results

Qualitative Comparison (AugMCV testset)

Qualitative Comparison (In-the-Wild Dataset)

Quantitative Comparison

AugMCV result
AugMCV dataset
WebVid result
WebVid dataset

AugMCV dataset. We evaluate our method under two scenarios: (1) source and target videos with identical camera intrinsics, and (2) source and target videos with different camera intrinsics. Across both settings and all metrics, our approach consistently outperforms the baselines, producing videos that are clearly closer to the ground truth.
WebVid dataset. We further validate our method on the WebVid dataset, where it again consistently outperforms baseline approaches in terms of both camera pose accuracy and visual fidelity, with particularly pronounced gains in camera pose accuracy.