RESULTS

Model results, measured by validation.

Ranked by Pass@1 on GameDevBench. Each score reflects one final effective attempt per task, evaluated by hidden Godot validation scripts.

LEADERBOARD

GameDevBench Pass@1

SEELE02 rows use the final effective validation results from the Seele report. External rows follow the GameDevBench README-style leaderboard reference.

RankModelHarnessFeedbackPassFailTotalPass@1
1SEELE02-proSeele ClawFinal effective run20912433362.8%
2SEELE02-flashSeele ClawFinal effective run18315033355.0%
3gemini-3-pro-previewGemini CLIScreenshot + Video53.8%
4gpt-5.4CodexScreenshot + Video52.0%
5gemini-3-flash-previewGemini CLIVideo46.9%
6gpt-5.4-miniCodexVideo43.2%
7gpt-5.4-miniOpenHandsBaseline38.4%
8claude-sonnet-4-5Claude CodeScreenshot + Video34.8%
9gemini-3-flash-previewOpenHandsScreenshot + Video31.8%
10kimi-k2.5OpenHandsScreenshot + Video20.7%
11claude-haiku-4-5Claude CodeVideo18.6%
12claude-haiku-4-5OpenHandsScreenshot + Video17.7%
13qwen3.5-397bOpenHandsBaseline5.4%

WHAT IS GAMEDEVBENCH?

A real game-development benchmark, not a code-only test.

GameDevBench evaluates agents on real Godot projects derived from web and video tutorials. Tasks require edits across scenes, scripts, UI, physics, shaders, TileSets, particles, resources, and runtime behavior. A submission passes only when Godot validation scripts say it passes.

333real Godot tasks
4task categories
4.7files changed on average
114lines changed on average

Task taxonomy

Official categories cover 2D Graphics & Animation, 3D Graphics & Animation, User Interface, and Gameplay Logic.

Scoring protocol

Pass@1 counts one final effective attempt per task. Model self-reporting is not used as evidence of success.

EXAMPLE TASKS

What the benchmark actually asks agents to do.

These examples are compressed from real GameDevBench task_config instructions, preserving the actual task substance while making them readable on the page.

Gameplay logictask_0002

Projectile quest progression

Update projectile behavior so enemy hits advance a QuestManager kill step, remove the projectile on impact, and clean it up after it travels off screen.

Animation / runtime statetask_0009

Unsafe platform warning cycle

Clone a SafePlatform into an UnsafePlatform with an AnimationPlayer that shifts warning colors and disables Area2D processing during the red danger window.

UI scene constructiontask_0020

Menus with exact layout and nodes

Create a viewport-filling Control scene with launch, pause, and restart panels, exact node names, offsets, labels, buttons, font overrides, CanvasLayer, and script wiring.

Shader / material setuptask_0100

HQX-style map smoothing

Update a Godot shader with border-smoothing uniforms and wire the required ShaderMaterial parameters on the MeshInstance3D while preserving its texture input.

TileSet resource editingtask_0180

Waterfall tile render order

Modify TileSet metadata so six waterfall atlas tiles render above the player with a higher z-index and half-opacity modulation, without changing the map structure.

Visual geometrytask_0220

Coin highlight placement

Place six tight semi-transparent Polygon2D circles over yellow star coins in a platformer image, minimizing spill outside each coin.

SOURCES

Evidence and references