Shading the World in Real Time via PEGA

PEGA: Physics Embedded Generative Architecture — a real-time world model system where graphics engines and video models are trained together through a shared physics-embedded representation.

System
Graphics
Engine
NeuralAsset
Identity · VFX · Dynamic Geometry
forward →
← ∇ grad
PEGA
Physics-Embedded
Representation
physics state asset memory VFX dynamic geometry scene continuity latent identity
forward →
← ∇ grad
Renderer
Video
Model
Generative Renderer
Forward pass (inference)
∇ Gradient (training)
— PEGA trains engine-side representations and video models end to end —

The ambition of world models is to simulate reality well enough that you can live inside it. Not watch it — live in it. Move through it, interact with it, build things in it, come back to it. The research community has made extraordinary progress on one dimension of this problem: making generated images and videos look real. And yet the interactive, persistent world remains out of reach.

Two directions have dominated the pursuit of AI-generated worlds, and both have run into the same wall from different sides. Systems like Marble build explicit 3D representations — neural radiance fields, 3D Gaussian Splatting — and render them in real time. The visual quality can be high. But the world is static. Physical interaction is absent. You can look at these worlds, but you cannot change them. Systems like Genie 3 condition video diffusion models on past frames and player actions. The visual imagination is impressive. But without a unified 3D coordinate system, there is nothing to hold the world together. Objects drift. Rooms rearrange. After a few minutes, the world has quietly become a different world.

The problem is not visual quality. It is structure. At Seele, we believe you cannot trade any of the following away and still have a world.

Persistence. Not three minutes of coherence. Not one hour. The vase on the table is still there when you come back from the next room. Persistence must be unconditional and permanent, the same way it is in every game that has ever shipped. A world that forgets is not a world — it is a hallucination that happens to look like one.
Real-time. Interaction requires frame-rate generation. The moment latency exceeds the threshold of perception, the loop between action and response breaks. A world model that cannot run at interactive frame rates is a video generator with extra steps. You are no longer inside a world. You are watching an offline render of one.
Reproducibility. Any traditional mesh can be exported from one engine and imported into another. The geometry travels. The material travels. The worlds a user creates must be shareable, transferable, and reproducible by anyone who receives them. Not locked inside a single model's latent space.
Integrity. The engine and the renderer are not two independent systems stitched together at the output. They are two halves of one system. During training, gradients must flow across the boundary — from generated frames back into the engine-side representation that carries the world's physical state. Integrity means there are no seams.

In this blog, we introduce PEGA: Physics Embedded Generative Architecture. Physics is the concrete world state maintained by the engine. Embedded is the mechanism that makes that state available to learned representation. Generative is the capability that turns it into real-time visual experience.

The Gap Between Physics and Light

A game engine knows where everything is. Every object, every surface, every moving part — the engine tracks it all with precision. It knows that the wooden chair is 1.2 meters from the wall, rotated seventeen degrees, and falling after a physics impulse knocked it over. This knowledge is exact. It updates every frame.

What the engine does not know is what that chair looks like. The grain of the wood catching the afternoon light. The shadow it throws across the floor. The way the varnish catches a highlight at glancing angles. These have to be calculated — and traditional rendering fills this gap with rules. Path tracing. Rasterization. Physically-based materials. The rules work, and they scale to extraordinary realism. But they are still rules. They describe light rather than understand it.

When you close your eyes and picture a ball thrown across a room, you do not compute a differential equation. You feel the arc. The mental image is physically plausible without being physically precise. A generative model trained on the full range of real visual experience can do the same thing for light — filling in everything that the rules would approximate, with something that has actually seen how the world looks. The engine draws the skeleton. The AI paints what you see.

PEGA connects these two kinds of knowledge through a physics-embedded representation: a compact, trainable form of engine-side state that preserves object identity, physical interaction, scene continuity, and editable appearance. It is not an adapter placed between two finished systems. It is the training surface where the engine and the video model become one loop.

Demo 1 — Real-time inference recorded in-engine: PEGA-guided video rendering for temporally stable world shading. All base game footage captured live from games built on seeles.ai.

NeuralAsset Memory

Here is what video generation models cannot do: remember.

Move through a room, look at the chair, walk to the window, come back — and the chair has changed. The oak color shifted to walnut. The grain runs in a different direction. The legs are slightly different heights. The model did not forget the chair. It never had a stable representation of it in the first place. Without something to anchor appearance to geometry, every new viewpoint is a fresh guess.

Every object in PEGA that requires stable identity gets a NeuralAsset: a latent asset memory stored alongside the engine's representation of that object. It is not a texture. It is not a material parameter. It is a learned memory that can encode identity, appearance, VFX behavior, and dynamic geometry — something the video model can use to reconstruct what that object should be whenever it appears in the scene.

The result is deterministic. The same object, the same viewpoint, always produces the same appearance. The model is anchored to what it has already learned about each asset. No drift. No reinterpretation. Come back after an hour — the chair is still the same chair.

We call this latent memory a NeuralAsset. A NeuralAsset is also portable. Once generated, it behaves like any conventional digital asset: exportable, shareable, loadable in a different scene. You can place ten instances of the same chair across different rooms and each renders consistently. You can send a NeuralAsset to another developer the way you send a mesh or a visual asset — and it works wherever it goes.

You can also edit it. Describe the change in plain language: "turn the impact effect into blue plasma," or "make the surface bloom as it opens." The object's memory updates. Its physical role, placement, and surrounding scene state remain intact.

Demo 2 — NeuralAsset latent asset memory for VFX and identity consistency: weapon replacement and enemy replacement preserve identity and appearance across viewpoints and sessions.
Demo 3 — Semantic VFX and dynamic geometry editing in real time: surface behavior and visual response change while the surrounding scene state remains intact.

Real-Time Interactive Worlds

The goal is not to generate a video.

A video plays once, in one direction, at a fixed camera path. A world responds to you, holds its state, and stays coherent no matter where you go or how long you stay. These are different problems. The existing approaches fail the second one in predictable ways.

Generate-then-Render systems — Marble and the broader family of 3D representation approaches — produce scenes with high visual quality. But the world is static. You can look. You cannot change anything. Video World Model systems — Genie 3 and similar — generate freely and respond to input. But without a 3D coordinate system holding things in place, the world reshapes itself underneath you. Over minutes, it becomes a different place.

PEGA positions between these two failure modes. The graphics engine provides the persistent physical state: collision, interaction, placement, and causality. The video model provides the visual layer, trained against that state at every step. The physical world stays coherent. The visual world is generated from it. Neither can drift independently of the other.

Running this in real time requires compressing what video generation models do — dozens of processing steps per frame — down to two or three, without losing the visual quality or the identity consistency that makes the system work. Recent advances in video model distillation have shown this is achievable: state-of-the-art systems now run large generation models at close to 20–40 frames per second on a single GPU. We apply the same approach, with the additional constraint that the compression must preserve the deterministic anchoring that keeps every object looking exactly as it should.

PEGA runs at over 25 frames per second with interaction latency under 100ms. Scenes hold together through continuous free-roaming sessions of 1 hour or more — zero structural drift, zero visual flickering. Not a demo that runs for ninety seconds before the scene destabilizes. A world that holds together for as long as you want to stay in it.

What Comes Next

Our early models are running. The architecture is validated end to end, and the first results confirm the core thesis: geometric anchoring gives the generative model what it needs to hold a scene together over time. We will continue publishing — covering the data strategy, the representation architecture, the distillation approach, and the physics integration that makes the world genuinely interactive.

We are SeeleAI. If you want to be among the first to experience a generative world that holds together for as long as you want to stay in it, join the waitlist below.

Stay updated

Join the Waitlist