Native 3D Editing · Feed-Forward Framework

VGGT-Edit
From 3D Reconstruction
to Native 3D Editing

VGGT-Edit is a feed-forward framework for instruction-driven native 3D scene editing. Built upon VGGT-Like 3D reconstruction models, it enables stable, view-consistent, and real-time 3D editing directly in geometry space.

Overview

Existing 3D editing pipelines often rely on multi-view 2D editing and re-projection, which leads to geometric inconsistency, drifting backgrounds, and unstable results. VGGT-Edit instead performs editing directly in native 3D space.

Figure 1
Figure 1. Comparison between existing 3D editing pipelines and VGGT-Edit. Existing methods rely on multi-view 2D editing, while VGGT-Edit performs editing directly in native 3D space.

Native 3D Editing

Avoids 2D lifting pipelines and performs scene editing directly in geometry space.

Feed-Forward Inference

Maintains low-latency editing with stable performance across different scenes.

View Consistency

Produces coherent multi-view editing results while preserving structural integrity.

Interactive Editing

Enables practical real-time 3D scene manipulation for spatial intelligence systems.

Core Contributions

VGGT-Edit introduces a unified native 3D editing framework with residual field learning, synchronized semantic injection, and view-aware geometric fusion.

Residual Field Prediction

Editing as 3D Residual Learning

Instead of reconstructing the entire scene, VGGT-Edit predicts localized 3D residual transformations. Static regions remain stable while editable regions are selectively modified.

Semantic Alignment

Depth-Synchronized Text Injection

Text semantics are continuously injected across multiple feature depths, enabling precise alignment between language instructions and spatial regions.

Geometry Fusion

View-Aware Weighting

The framework dynamically estimates view reliability to reduce artifacts, improve boundary consistency, and maintain geometric coherence.

Editing Head

Dedicated Native 3D Editing Head

A specialized editing branch directly predicts residual deformation fields, allowing VGGT-Like models to evolve from reconstruction systems into editable 3D world models.

Pipeline

VGGT-Edit transforms instruction-driven scene editing into a unified feed-forward 3D residual prediction process.

VGGT-Edit Pipeline
Overall architecture of VGGT-Edit. The framework predicts localized residual deformation fields directly in geometry space.
1

Input Views

Sparse multi-view images

2

Text Instruction

Semantic editing guidance

3

Residual Prediction

Localized 3D deformation field

4

Native Editing

View-consistent edited scene

DeltaScene Dataset

To support native 3D editing research, we construct DeltaScene, a large-scale dataset with nearly 100K instruction-driven 3D editing pairs.

DeltaScene Dataset
Figure 2. Overview of the DeltaScene dataset.
Dataset Pipeline
Figure 3. Automated construction pipeline for DeltaScene using LLM and VLM consensus filtering.

100K

Editing Pairs

5s

Per-Scene Editing

120×

Acceleration over Optimization Methods

Native

Feed-Forward 3D Editing

Results

VGGT-Edit achieves stable multi-view editing results while maintaining fast feed-forward inference.

Qualitative Results
Figure 4. Qualitative comparisons across different 3D editing tasks.
Generalization Results
Figure 5. Generalization to unseen editing instructions.

Citation

@article{vggtedit2026,
  title={VGGT-Edit: Feed-Forward Native 3D Scene Editing via Residual Field Prediction},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2605.15186},
  year={2026}
}