The ambition of world models is to simulate reality well enough that you can live inside it. Not watch it — live in it. Move through it, interact with it, build things in it, come back to it. The research community has made extraordinary progress on one dimension of this problem: making generated images and videos look real. And yet the interactive, persistent world remains out of reach.
Two directions have dominated the pursuit of AI-generated worlds, and both have run into the same wall from different sides. Systems like Marble build explicit 3D representations — neural radiance fields, 3D Gaussian Splatting — and render them in real time. The visual quality can be high. But the world is static. Physical interaction is absent. You can look at these worlds, but you cannot change them. Systems like Genie 3 condition video diffusion models on past frames and player actions. The visual imagination is impressive. But without a unified 3D coordinate system, there is nothing to hold the world together. Objects drift. Rooms rearrange. After a few minutes, the world has quietly become a different world.
The problem is not visual quality. It is structure. At Seele, we believe you cannot trade any of the following away and still have a world.
In this blog, we introduce NeuralG-Bridge — a new world model training paradigm bridging game engines and video generation.
The Gap Between Physics and Light
A game engine knows where everything is. Every object, every surface, every moving part — the engine tracks it all with precision. It knows that the wooden chair is 1.2 meters from the wall, rotated seventeen degrees, and falling after a physics impulse knocked it over. This knowledge is exact. It updates every frame.
What the engine does not know is what that chair looks like. The grain of the wood catching the afternoon light. The shadow it throws across the floor. The way the varnish catches a highlight at glancing angles. These have to be calculated — and traditional rendering fills this gap with rules. Path tracing. Rasterization. Physically-based materials. The rules work, and they scale to extraordinary realism. But they are still rules. They describe light rather than understand it.
When you close your eyes and picture a ball thrown across a room, you do not compute a differential equation. You feel the arc. The mental image is physically plausible without being physically precise. A generative model trained on the full range of real visual experience can do the same thing for light — filling in everything that the rules would approximate, with something that has actually seen how the world looks. The engine draws the skeleton. The AI paints what you see.
Between them sits a lightweight data stream that every modern game engine already produces: the G-Buffer. It carries the engine's complete summary of the scene — where surfaces are, which direction they face, how fast they are moving, which object they belong to — plus a compact learned signal that gives the AI model the identity of each object. The G-Buffer is not a new invention. It is an existing interface, connected to a generative model for the first time.
NeuralG-Bridge
Here is what video generation models cannot do: remember.
Move through a room, look at the chair, walk to the window, come back — and the chair has changed. The oak color shifted to walnut. The grain runs in a different direction. The legs are slightly different heights. The model did not forget the chair. It never had a stable representation of it in the first place. Without something to anchor appearance to geometry, every new viewpoint is a fresh guess.
Every object in NeuralG-Bridge that requires stable identity gets a NeuralAsset: a compact learned representation of how that specific object looks, stored alongside its geometry. Not a texture. Not a material parameter. A learned encoding that captures the object's appearance from any angle, under any lighting — something the AI model can use to reconstruct exactly what that object should look like whenever it appears in the scene.
The result is deterministic. The same object, the same viewpoint, always produces the same appearance. The model is anchored to what it has already learned about each asset. No drift. No reinterpretation. Come back after an hour — the chair is still the same chair.
We call this representation a NeuralAsset. A NeuralAsset is also portable. Once generated, it behaves like any conventional digital asset: exportable, shareable, loadable in a different scene. You can place ten instances of the same chair across different rooms and each renders consistently. You can send a NeuralAsset to another developer the way you send a mesh or a texture file — and it works wherever it goes.
You can also edit it. Describe the change in plain language: "change the material to marble." The object's appearance updates. Its geometry, its physics, its position in the scene — nothing else moves.
Real-Time Interactive Worlds
The goal is not to generate a video.
A video plays once, in one direction, at a fixed camera path. A world responds to you, holds its state, and stays coherent no matter where you go or how long you stay. These are different problems. The existing approaches fail the second one in predictable ways.
Generate-then-Render systems — Marble and the broader family of 3D representation approaches — produce scenes with high visual quality. But the world is static. You can look. You cannot change anything. Video World Model systems — Genie 3 and similar — generate freely and respond to input. But without a 3D coordinate system holding things in place, the world reshapes itself underneath you. Over minutes, it becomes a different place.
NeuralG-Bridge positions between these two failure modes. The engine provides the geometric ground truth — physics, collision, motion — at full fidelity. The generative model provides the visual layer, conditioned on that geometry at every frame. The geometry is always correct. The appearance is always derived from the geometry. Neither can drift independently of the other.
Running this in real time requires compressing what video generation models do — dozens of processing steps per frame — down to two or three, without losing the visual quality or the identity consistency that makes the system work. Recent advances in video model distillation have shown this is achievable: state-of-the-art systems now run large generation models at close to 20–40 frames per second on a single GPU. We apply the same approach, with the additional constraint that the compression must preserve the deterministic anchoring that keeps every object looking exactly as it should.
NeuralG-Bridge runs at over 25 frames per second with interaction latency under 100ms. Scenes hold together through continuous free-roaming sessions of 1 hour or more — zero structural drift, zero visual flickering. Not a demo that runs for ninety seconds before the scene destabilizes. A world that holds together for as long as you want to stay in it.
What Comes Next
Our early models are running. The architecture is validated end to end, and the first results confirm the core thesis: geometric anchoring gives the generative model what it needs to hold a scene together over time. We will continue publishing — covering the data strategy, the conditioning architecture, the distillation approach, and the physics integration that makes the world genuinely interactive.
We are SeeleAI. If you want to be among the first to experience a generative world that holds together for as long as you want to stay in it, join the waitlist below.
Stay updated
Join the Waitlist