Geometric consistency
Persistent explicit 3D memory helps stabilize structure across long sequences and viewpoint changes, while also making scene editing easier and creating a more practical bridge to existing 3D assets and production workflows.
A hybrid world model that combines explicit 3D memory with 2D video generation, built to preserve geometry, sustain long-term coherence, and unlock controllable scene evolution.

We present 4World, a hybrid 4D world model that combines the explicit memory capacity of 3D representations with the generative power of 2D video generation. The result is a system that preserves spatial structure while still being able to synthesize, extend, and evolve scenes over time, with a long-term goal of rethinking parts of the conventional graphics engine pipeline for interactive worlds.
Existing paradigms often force a compromise. Pure 2D video generation can look compelling but tends to lose geometric consistency over long horizons, while explicit 3D reconstruction preserves structure but lacks strong generative capability for unseen motion, dynamic edits, and future evolution. 4World bridges that gap by pairing an explicit 3D memory with a 2D video generation model.
A strong world model should preserve geometric consistency as viewpoints change, while also generating plausible dynamics as the scene evolves over time. In practice, most existing approaches tend to be better at one of these goals than the other.
In 4World, an explicit 3D representation serves as persistent scene memory, while a 2D video generation model provides temporal refinement, restoration, and completion. This hybrid design allows 4World to maintain long-term spatial structure while still synthesizing coherent unobserved content and future scene evolution.
Persistent explicit 3D memory helps stabilize structure across long sequences and viewpoint changes, while also making scene editing easier and creating a more practical bridge to existing 3D assets and production workflows.
Video priors restore imperfect projections, fill disocclusions, and extend scenes beyond observed content.
Explicit 3D modeling enables more precise camera control, while text guidance and dynamic masks provide additional user-facing action signals.
Each generation step extends the persistent world state, making the environment more explorable, more editable, and able to grow beyond its initial observations.
The workflow is organized as an iterative loop: initialize memory, condition on controls, render a reference, generate new content, and update the 3D cache. That loop lets 4World keep a stable scene identity while continuously synthesizing new observations.
Input video is reconstructed into a 3D Gaussian representation that serves as the scene memory for later rendering and editing.
Camera trajectories, text prompts, and optional dynamic masks define how the world should evolve.
The cached world is projected into a reference frame and boundary mask that guide completion and inpainting.
A video model repairs rendering artifacts, fills missing content, and produces a coherent new sequence from the reference render.
The generated result is aligned back into the 3D cache so future generations inherit the new world state.
Raw 3D projections are rarely perfect. They can contain artifacts, missing regions, noise, and disocclusion gaps. Instead of treating these as fatal weaknesses, 4World turns them into a handoff point. The generative model acts as a restoration layer that refines projected views into coherent video frames while respecting scene structure.
This matters because it lets the system remain grounded in explicit geometry while still producing high-quality outputs. Rather than replacing the 3D world with pure image synthesis, 4World uses generation to complete and elevate what explicit memory makes possible.
A persistent world representation is useful beyond one benchmark. By combining controllable video generation with explicit spatial memory, 4World becomes a practical foundation for interactive simulation, editable environments, and viewpoint-consistent content creation.
Support exploration from unseen camera trajectories with stronger consistency.


4World advances a hybrid direction for world modeling by combining explicit spatial memory with generative temporal dynamics. The result is a system designed not only to render scenes, but to preserve them, evolve them, and keep them coherent as interaction unfolds.