Hybrid 4D World Model

4World

A hybrid world model that combines explicit 3D memory with 2D video generation, built to preserve geometry, sustain long-term coherence, and unlock controllable scene evolution.

4World teaser

Explicit 3D memory + 2D video generation

4World combines the persistent memory capacity of explicit 3D scene representations with the generative power of 2D video models, keeping worlds structurally grounded while allowing them to evolve, respond, and remain coherent over time. As a game research effort, it also points toward a future alternative to parts of the traditional graphics engine pipeline.

We present 4World, a hybrid 4D world model that combines the explicit memory capacity of 3D representations with the generative power of 2D video generation. The result is a system that preserves spatial structure while still being able to synthesize, extend, and evolve scenes over time, with a long-term goal of rethinking parts of the conventional graphics engine pipeline for interactive worlds.

Existing paradigms often force a compromise. Pure 2D video generation can look compelling but tends to lose geometric consistency over long horizons, while explicit 3D reconstruction preserves structure but lacks strong generative capability for unseen motion, dynamic edits, and future evolution. 4World bridges that gap by pairing an explicit 3D memory with a 2D video generation model.

4World Video Demo.

Why a hybrid world model matters

A strong world model should preserve geometric consistency as viewpoints change, while also generating plausible dynamics as the scene evolves over time. In practice, most existing approaches tend to be better at one of these goals than the other.

In 4World, an explicit 3D representation serves as persistent scene memory, while a 2D video generation model provides temporal refinement, restoration, and completion. This hybrid design allows 4World to maintain long-term spatial structure while still synthesizing coherent unobserved content and future scene evolution.

Geometric consistency

Persistent explicit 3D memory helps stabilize structure across long sequences and viewpoint changes, while also making scene editing easier and creating a more practical bridge to existing 3D assets and production workflows.

Generative flexibility

Video priors restore imperfect projections, fill disocclusions, and extend scenes beyond observed content.

Interactive control

Explicit 3D modeling enables more precise camera control, while text guidance and dynamic masks provide additional user-facing action signals.

Autoregressive updates

Each generation step extends the persistent world state, making the environment more explorable, more editable, and able to grow beyond its initial observations.

Autoregressive generation pipeline

The workflow is organized as an iterative loop: initialize memory, condition on controls, render a reference, generate new content, and update the 3D cache. That loop lets 4World keep a stable scene identity while continuously synthesizing new observations.

01

Initialize a persistent 3D cache

Input video is reconstructed into a 3D Gaussian representation that serves as the scene memory for later rendering and editing.

02

Specify control signals

Camera trajectories, text prompts, and optional dynamic masks define how the world should evolve.

03

Render a novel reference view

The cached world is projected into a reference frame and boundary mask that guide completion and inpainting.

04

Generate and refine new video

A video model repairs rendering artifacts, fills missing content, and produces a coherent new sequence from the reference render.

05

Update memory with new content

The generated result is aligned back into the 3D cache so future generations inherit the new world state.

Video refinement of 3DGS renderings

Raw 3D projections are rarely perfect. They can contain artifacts, missing regions, noise, and disocclusion gaps. Instead of treating these as fatal weaknesses, 4World turns them into a handoff point. The generative model acts as a restoration layer that refines projected views into coherent video frames while respecting scene structure.

This matters because it lets the system remain grounded in explicit geometry while still producing high-quality outputs. Rather than replacing the 3D world with pure image synthesis, 4World uses generation to complete and elevate what explicit memory makes possible.

Applications

A persistent world representation is useful beyond one benchmark. By combining controllable video generation with explicit spatial memory, 4World becomes a practical foundation for interactive simulation, editable environments, and viewpoint-consistent content creation.

Free-viewpoint navigation

Support exploration from unseen camera trajectories with stronger consistency.

Editing application

Scene editing

Users can freely edit the 3D scene while preserving a coherent spatial representation.

Dynamic memory application

Dynamic memory

The explicit 3D memory is continuously maintained and updated as the world evolves.

Toward persistent, controllable world simulation

4World advances a hybrid direction for world modeling by combining explicit spatial memory with generative temporal dynamics. The result is a system designed not only to render scenes, but to preserve them, evolve them, and keep them coherent as interaction unfolds.