VGGT-Edit is a feed-forward framework for instruction-driven native 3D scene editing. Built upon VGGT-Like 3D reconstruction models, it enables stable, view-consistent, and real-time 3D editing directly in geometry space.
Existing 3D editing pipelines often rely on multi-view 2D editing and re-projection, which leads to geometric inconsistency, drifting backgrounds, and unstable results. VGGT-Edit instead performs editing directly in native 3D space.
Avoids 2D lifting pipelines and performs scene editing directly in geometry space.
Maintains low-latency editing with stable performance across different scenes.
Produces coherent multi-view editing results while preserving structural integrity.
Enables practical real-time 3D scene manipulation for spatial intelligence systems.
VGGT-Edit introduces a unified native 3D editing framework with residual field learning, synchronized semantic injection, and view-aware geometric fusion.
Instead of reconstructing the entire scene, VGGT-Edit predicts localized 3D residual transformations. Static regions remain stable while editable regions are selectively modified.
Text semantics are continuously injected across multiple feature depths, enabling precise alignment between language instructions and spatial regions.
The framework dynamically estimates view reliability to reduce artifacts, improve boundary consistency, and maintain geometric coherence.
A specialized editing branch directly predicts residual deformation fields, allowing VGGT-Like models to evolve from reconstruction systems into editable 3D world models.
VGGT-Edit transforms instruction-driven scene editing into a unified feed-forward 3D residual prediction process.
Sparse multi-view images
Semantic editing guidance
Localized 3D deformation field
View-consistent edited scene
To support native 3D editing research, we construct DeltaScene, a large-scale dataset with nearly 100K instruction-driven 3D editing pairs.
Editing Pairs
Per-Scene Editing
Acceleration over Optimization Methods
Feed-Forward 3D Editing
VGGT-Edit achieves stable multi-view editing results while maintaining fast feed-forward inference.
@article{vggtedit2026,
title={VGGT-Edit: Feed-Forward Native 3D Scene Editing via Residual Field Prediction},
author={Anonymous Authors},
journal={arXiv preprint arXiv:2605.15186},
year={2026}
}