Native Multimodal Foundation Model
A native multimodal foundation model built on a Mixture-of-Transformers (MoT) architecture.
Text, 3D, and spatial signals share one representation, with native understanding, generation, and multi-turn editing.
MoT · Architecture
Seele02 trains in a multimodal space from day one. Two expert networks — understanding and generation — keep their parameters fully decoupled, while a single shared self-attention handles cross-modal alignment and fusion.
Extracts semantics and relationships from text, 3D structure, and spatial signals, and feeds stable context to reasoning, planning, tool calls, and multi-turn editing.
Specializes in geometry, texture, and spatiotemporal consistency. Outputs stay controllable, editable, and preserve identity across multi-turn interactions.
All modalities share the same global self-attention as a universal bus. Experts read and align inside one context, end to end, with information preserved across the seam.
Decoupled experts give independently scalable training and inference paths. A unified interface lets Agents on top stay focused on the goal while modality switching happens underneath.
Three Modalities · Native
Text, 3D, and spatial share a unified representation space inside Seele02, pre-trained against the same global attention. Cross-modal understanding, generation, and multi-turn editing come naturally.
Instruction following, reasoning, long-context dialogue, and tool use — the semantic backbone of the Agent and the lingua franca across modalities.
Understand and generate geometric assets in a native 3D representation. Multi-turn, language-driven edits keep geometry, topology, and asset identity stable.
A unified view of spatial relationships, camera motion, and scene structure — for interactive worlds, spatial planning, and controllable scene evolution.
One model reads a 3D scene, explains it in text, renders a new viewpoint, and edits the spatial layout from a natural-language instruction — all in a single context, end to end.
Agent Base Model
Tool use is built in at pre-training time, as the model's native interface. Seele02 drives complex tasks that require multi-step planning, cross-modal decisions, and calls to external capabilities.
Native function calling. Structured arguments and return values flow seamlessly through cross-modal context.
Long context plus shared cross-modal attention lets Agents iterate on the same asset across multiple turns and modalities.
One model switches its decision basis across text, 3D, and spatial — a single backbone covers every modality.
Objects, scenes, and characters keep their identity across multi-turn interactions and cross-modal transforms — stable, end to end.
Two SKUs · One Base
Two product lines for different workloads. Both share the same MoT base and Agent interface, so applications can switch between them on demand.
Tuned for real-time conversation, instant generation, and high-frequency Agent calls. Lower latency and higher throughput for experiences that demand sub-second feedback.
Designed for long-horizon planning, multi-step reasoning, and high-fidelity generation. Suited for content production pipelines, Agent orchestration, and quality-critical workflows.
Looking for a native foundation for next-generation multimodal experiences? Let products and Agents think, generate, and act inside one unified MoT framework.