VGGT-Edit

Overview

Existing 3D editing pipelines often rely on multi-view 2D editing and re-projection, which leads to geometric inconsistency, drifting backgrounds, and unstable results. VGGT-Edit instead performs editing directly in native 3D space.

Figure 1. Comparison between existing 3D editing pipelines and VGGT-Edit. Existing methods rely on multi-view 2D editing, while VGGT-Edit performs editing directly in native 3D space.

Native 3D Editing

Avoids 2D lifting pipelines and performs scene editing directly in geometry space.

Feed-Forward Inference

Maintains low-latency editing with stable performance across different scenes.

View Consistency

Produces coherent multi-view editing results while preserving structural integrity.

Interactive Editing

Enables practical real-time 3D scene manipulation for spatial intelligence systems.

Core Contributions

VGGT-Edit introduces a unified native 3D editing framework with residual field learning, synchronized semantic injection, and view-aware geometric fusion.

Residual Field Prediction

Editing as 3D Residual Learning

Instead of reconstructing the entire scene, VGGT-Edit predicts localized 3D residual transformations. Static regions remain stable while editable regions are selectively modified.

Semantic Alignment

Depth-Synchronized Text Injection

Text semantics are continuously injected across multiple feature depths, enabling precise alignment between language instructions and spatial regions.

Geometry Fusion

View-Aware Weighting

The framework dynamically estimates view reliability to reduce artifacts, improve boundary consistency, and maintain geometric coherence.

Editing Head

Dedicated Native 3D Editing Head

A specialized editing branch directly predicts residual deformation fields, allowing VGGT-Like models to evolve from reconstruction systems into editable 3D world models.

Pipeline

VGGT-Edit transforms instruction-driven scene editing into a unified feed-forward 3D residual prediction process.

Overall architecture of VGGT-Edit. The framework predicts localized residual deformation fields directly in geometry space.

Input Views

Sparse multi-view images

Text Instruction

Semantic editing guidance

Residual Prediction

Localized 3D deformation field

Native Editing

View-consistent edited scene

DeltaScene Dataset

To support native 3D editing research, we construct DeltaScene, a large-scale dataset with nearly 100K instruction-driven 3D editing pairs.

Figure 2. Overview of the DeltaScene dataset.

Figure 3. Automated construction pipeline for DeltaScene using LLM and VLM consensus filtering.

100K

Editing Pairs

5s

Per-Scene Editing

120×

Acceleration over Optimization Methods

Native

Feed-Forward 3D Editing

Results

VGGT-Edit achieves stable multi-view editing results while maintaining fast feed-forward inference.

Figure 4. Qualitative comparisons across different 3D editing tasks.

Figure 5. Generalization to unseen editing instructions.

Citation

@article{vggtedit2026,
  title={VGGT-Edit: Feed-Forward Native 3D Scene Editing via Residual Field Prediction},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2605.15186},
  year={2026}
}