Novel View Synthesis as Video Completion

Carnegie Mellon University

We repurpose pretrained video diffusion models for sparse-view novel view synthesis by treating unordered multi-view images as independent single-frame videos.

🔊 This video has audio narration — click unmute to listen.

From image to video priors for sparse NVS. We repurpose pretrained video diffusion models for permutation-invariant novel view synthesis with only 1K training scenes and ~1% trainable parameters.

Method Overview

Method overview

Each input image is encoded independently by a frozen video VAE as a single-frame “video”, ensuring permutation-invariant encoding. Camera poses are injected via Plücker ray maps concatenated channel-wise with image latents. The resulting view tokens are concatenated along the temporal dimension, and only the predicted query latent is used for decoding. We also remove the temporal component of Rotary Positional Embeddings (RoPE) so the transformer cannot rely on frame order, making the model invariant to input order. We fine-tune only the patch embedding layer and LoRA modules in the DiT backbone.

Qualitative Comparisons

All outputs are center-cropped to 480×480 for fair comparison.

6 Input Views

Target View 1

Ground Truth
Ours
SEVA
LVSM

Target View 2

Ground Truth
Ours
SEVA
LVSM

Continuous Trajectory Rendering

Although each target view is generated independently as a separate forward pass, densely sampling cameras along a smooth trajectory and concatenating the results produces coherent, temporally consistent videos. This demonstrates that the strong 3D priors from video diffusion ensure cross-view consistency even without explicit temporal modeling.

blue = input views  

NVS with Varying Number of Input Views

Although trained with a fixed number of input views, our model generalizes to different numbers of inputs at test time. Below we show the same target view generated from K = 3, 4, 5, and 6 input views.

K = 3

K = 4

K = 5

K = 6

Visual Ablations

We visualize the effect of key design choices on the same target view. Each example shows the 6 input views and the ground-truth target, followed by ablation variants: (a) w/o per-view encoding, (b) w/o pixel unshuffle, and (c) our full model.

Without per-view independent encoding, the causal VAE entangles information across frames, producing severely corrupted outputs that bear little resemblance to the target view. Without pixel unshuffle, the ray map geometry is lost during downsampling, leading to plausible but inaccurately posed generations that fail to align with the target camera.

Input 1
Input 2
Input 3
Input 4
Input 5
Input 6
6 Input Views
GT Target
Target (GT)
w/o per-view encoding
(a) w/o Per-View Encoding
w/o pixel unshuffle
(b) w/o Pixel Unshuffle
Ours (full model)
(c) Ours (Full Model)

Permutation Invariance

A key design goal is invariance to the ordering of input views. Below we show outputs from 10 random permutations of the same input set. With our permutation-invariant designs, the generated target views are virtually identical regardless of input order, whereas model with temporal positional encoding produce noticeably different outputs.

Input 1
Input 2
Input 3
Input 4
Input 5
Input 6
6 Input Views
GT Target
Target (GT)

With Temporal RoPE (not permutation-invariant)

Ours (permutation-invariant)

Data Efficiency

Our model demonstrates remarkable data efficiency. Even when trained on only 20 scenes, it surpasses EscherNet trained on 10K scenes. Performance improves consistently as training data increases, demonstrating that video diffusion priors provide a strong geometric foundation that requires minimal multi-view supervision to unlock.

Data scaling curve

Performance vs. number of training scenes.