Native Multimodal Foundation Model

Seele02

A native multimodal foundation model built on a Mixture-of-Transformers (MoT) architecture.
Text, 3D, and spatial signals share one representation, with native understanding, generation, and multi-turn editing.

Mixture-of-TransformersText · 3D · SpatialAgent-Ready

MoT · Architecture

Understanding and generation experts,
fused by a shared global self-attention.

Seele02 trains in a multimodal space from day one. Two expert networks — understanding and generation — keep their parameters fully decoupled, while a single shared self-attention handles cross-modal alignment and fusion.

UnderstandingUnderstanding ExpertSemantics · Reasoning · Planning
GenerationGeneration ExpertGeometry · Texture · Consistency
Shared Global Self-AttentionCross-modal fusion bus
Text
3D
Spatial
Understanding Expert
Reasoning side, focused.

Extracts semantics and relationships from text, 3D structure, and spatial signals, and feeds stable context to reasoning, planning, tool calls, and multi-turn editing.

Generation Expert
Generation side, focused.

Specializes in geometry, texture, and spatiotemporal consistency. Outputs stay controllable, editable, and preserve identity across multi-turn interactions.

Shared Self-Attention
One global bus across modalities.

All modalities share the same global self-attention as a universal bus. Experts read and align inside one context, end to end, with information preserved across the seam.

Decoupled · Unified
Decoupled parameters, unified interface.

Decoupled experts give independently scalable training and inference paths. A unified interface lets Agents on top stay focused on the goal while modality switching happens underneath.

Three Modalities · Native

Three modalities, one representation.

Text, 3D, and spatial share a unified representation space inside Seele02, pre-trained against the same global attention. Cross-modal understanding, generation, and multi-turn editing come naturally.

01text

Text

Instruction following, reasoning, long-context dialogue, and tool use — the semantic backbone of the Agent and the lingua franca across modalities.

  • Multi-turn instructions and tool calls
  • Cross-modal description and explanation
  • Structured outputs and function schemas
023d

3D

Understand and generate geometric assets in a native 3D representation. Multi-turn, language-driven edits keep geometry, topology, and asset identity stable.

  • 3D asset generation and editing
  • View-consistent identity preservation
  • Production-ready exportable formats
03spatial

Spatial

A unified view of spatial relationships, camera motion, and scene structure — for interactive worlds, spatial planning, and controllable scene evolution.

  • Spatial relationships and scene graphs
  • Camera trajectories and viewpoint control
  • Persistent, explorable world state

One model reads a 3D scene, explains it in text, renders a new viewpoint, and edits the spatial layout from a natural-language instruction — all in a single context, end to end.

Agent Base Model

Native tool use. A foundation model for general-purpose Agents.

Tool use is built in at pre-training time, as the model's native interface. Seele02 drives complex tasks that require multi-step planning, cross-modal decisions, and calls to external capabilities.

Native Tool Use

Native function calling. Structured arguments and return values flow seamlessly through cross-modal context.

Multi-turn planning and editing

Long context plus shared cross-modal attention lets Agents iterate on the same asset across multiple turns and modalities.

Cross-modal decisions

One model switches its decision basis across text, 3D, and spatial — a single backbone covers every modality.

Stable asset identity

Objects, scenes, and characters keep their identity across multi-turn interactions and cross-modal transforms — stable, end to end.

Two SKUs · One Base

Model lineup.

Two product lines for different workloads. Both share the same MoT base and Agent interface, so applications can switch between them on demand.

Flash

Built for low-latency interaction.

Tuned for real-time conversation, instant generation, and high-frequency Agent calls. Lower latency and higher throughput for experiences that demand sub-second feedback.

  • Real-time multimodal dialogue
  • High-frequency tool calls and orchestration
  • Instant generation inside product UI
Pro

Built for complex reasoning and high-quality production.

Designed for long-horizon planning, multi-step reasoning, and high-fidelity generation. Suited for content production pipelines, Agent orchestration, and quality-critical workflows.

  • Long, multi-step reasoning
  • High-quality 3D and spatial generation
  • Agent orchestration and multi-agent collaboration

Build with Seele02

Looking for a native foundation for next-generation multimodal experiences? Let products and Agents think, generate, and act inside one unified MoT framework.