Picture a seven-year-old who just finished a chapter of Harry Potter. She closes the book, looks up at the ceiling, and says: "I want to go there."
Not watch it. Not read about it again. Go there.
That is the gap WorldGen is aimed at. Existing tools get part of the way there, but stop short: videos are passive, chat is text-only, image models make postcards, and full game production is far too heavy. Kids do not imagine in paragraphs. They imagine in places.
There is a deeper use case too. Safe, fantastical worlds can help children practice conversation, confidence, and emotional regulation with lower stakes. A child who struggles to look a classmate in the eye may first try it with a fictional wizard.
So the goal was not "AI generates a 3D screenshot." The goal was a world a child can describe in one sentence, enter quickly, talk to, shape, and return to. The technical approach follows from that: agentic models work over a structured scene format, deterministic systems validate and compile it, and the browser runtime turns it into a playable space with embodied characters.
That's WorldGen. Prompt → Scene Intent → Scene DSL → Compilation → Playable World → Persistent State.
WorldGen is a three-layer system:
A Python orchestrator runs a multi-role LLM loop against a typed Scene DSL:
- Director model parses user intent and proposes a candidate scene description — which world family, which layout modules, which atmospheric presets, which characters
- Deterministic validator enforces hard structural rules: connector compatibility, asset availability, overlap detection, walkability chains, and scene budget
- Critic model scores the draft against a formal rubric — prompt alignment, spatial legibility, NPC accessibility, atmosphere coherence — and proposes bounded revisions
- Director refines and the loop repeats up to 4 iterations, or until the quality threshold is met
- Compiler resolves the accepted spec into a deterministic
CompiledSceneManifest: exact module transforms, prop placements, navmesh inputs, character spawn anchors
The models never touch runtime geometry directly. They operate on a constrained DSL, and the engine handles everything else.
The planning pipeline is powered by three specialized agents, each with a distinct role:
| Agent | Default Model | Role |
|---|---|---|
| Director | MBZUAI-IFM/K2-Think-v2 |
Parses user intent, proposes and refines Scene DSL drafts, runs the critic-refine loop |
| Builder | google/gemini-3.1-flash-lite-preview |
Powers NPC dialogue, outputs structured actor turns (emotion, gesture, movement intent, memory write) |
| Seer | Qwen/Qwen3-VL-32B-Instruct (local, ASUS GPU) |
Multimodal visual critic — inspects rendered scene previews and scores against the quality rubric before the scene loads |
All three agents route through the same OpenRouter-compatible API abstraction — swap any model via environment variable without touching orchestration logic. The Seer agent runs locally on the ASUS GPU (vLLM), keeping visual critique off the OpenRouter bill and sub-second for iteration loops.
The compiled manifest loads directly in the browser:
- React Three Fiber + three.js renders the full scene with physically-based materials, atmospheric fog, and dynamic lighting
- Rapier handles rigid-body physics and player collision
- recast-navigation-js generates a navmesh at runtime from the resolved scene geometry — no pre-baked assets, fully dynamic
- First-person and orbit camera modes with WASD movement and physics-driven character interactions
- NPC embodiment: characters pathfind toward the player, orient to face them, idle with body animation, and trigger dialogue on proximity
Characters are not chatbots floating over the scene. They are embodied:
- Generated avatars from a three-stage pipeline: text prompt → FLUX image generation → Meshy image-to-3D → facial rig post-processing → runtime GLB spawn
- Azure Speech TTS with neural voices and viseme-driven lip sync — blend shape events from Azure drive real-time facial animation frame by frame
- Azure STT for push-to-talk voice input
- Actor model (OpenRouter-routed) drives dialogue and returns structured JSON: spoken text, emotion tag, gesture, movement intent, and memory write
- Persistent character memory across turns — characters accumulate context, not just conversation history
User Prompt
│
▼
┌─────────────────────────────────────────────────────────┐
│ PYTHON ORCHESTRATOR (FastAPI) │
│ │
│ Director Model ──▶ Scene DSL Draft │
│ ▲ │ │
│ │ Deterministic Validator │
│ │ (schema, assets, │
│ │ connectors, walkability) │
│ │ │ │
│ Critic Model ◀──── Scene Quality Rubric │
│ (prompt alignment, spatial legibility, │
│ NPC accessibility, atmosphere coherence) │
│ │ │
│ └──── max 4 refinement iterations │
│ │ │
│ Scene Compiler │
│ (resolves transforms, anchors, │
│ navmesh inputs, spawn points) │
│ │ │
│ CompiledSceneManifest (JSON) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ TYPESCRIPT BROWSER RUNTIME (Next.js) │
│ │
│ SceneCanvas (R3F) WorldRoot PlayerController │
│ NPCManager NavMesh PhysicsWorld │
│ DialogueOverlay VoiceInput SyncBridge (Convex) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ CHARACTER GENERATION PIPELINE │
│ │
│ Prompt ──▶ FLUX Image ──▶ Meshy 3D ──▶ Facial Rig │
│ (FAL.ai) (image-to-3D) (blend shapes) │
│ │ │
│ Runtime GLB spawn + lip sync │
└─────────────────────────────────────────────────────────┘
This is the key insight. Models don't write code. They write structured world descriptions that the engine knows how to validate, compile, and render:
{
"scene_id": "uuid",
"world_family": "magical_forest",
"layout": {
"modules": [
{
"module_id": "forest_path_A",
"instance_id": "m1",
"transform": { "position": [0, 0, 0], "rotation": [0, 0, 0], "scale": [1, 1, 1] }
},
{
"module_id": "forest_clearing_B",
"instance_id": "m2",
"transform": { "position": [0, 0, -18], "rotation": [0, 1.57, 0], "scale": [1, 1, 1] }
}
],
"connections": [
{ "from": "m1.exit_north", "to": "m2.entry_south" }
]
},
"atmosphere": {
"skybox": "night_fog_soft",
"lighting": "moonlit_blue",
"fog_density": 0.25,
"ambient_audio": "forest_wind_low"
},
"characters": [
{
"character_archetype": "young_supportive_wizard",
"spawn_anchor": "m2.anchor_npc_01",
"behavior_profile": "follow_nearby",
"conversation_profile": "supportive_companion_v1"
}
]
}The engine validates every field. If a module doesn't exist in the registry, the spec is rejected. If two modules overlap, the compiler catches it. If no path exists from the player spawn to the NPC, the critic flags it before the user ever sees it.
Characters don't get free-form animation control. The actor model outputs semantic action tags from a controlled vocabulary:
| Action Tag | What It Does |
|---|---|
face_player |
Smooth orient rotation toward player |
take_small_step_closer |
Single short navmesh-guided step |
walk_to_anchor |
Full pathfinding to a named anchor |
listen_neutral |
Idle pose with attention signal |
talk_calm |
Talking body animation, calm variant |
talk_excited |
Talking body animation, high energy |
gesture_reassure |
Hand-raise reassurance gesture |
gesture_point |
Directional point toward object |
look_at_object |
Head/gaze orient to named prop anchor |
The runtime maps these to actual clips and motion logic. The model never controls bones. That's how characters look intentional rather than uncanny.
Every NPC dialogue turn returns strict typed JSON:
{
"spoken_text": "Stay close. I heard it too, but it sounds farther away now.",
"emotion": "calm_supportive",
"gesture": "gesture_reassure",
"movement_intent": "face_player",
"memory_write": "Player became uneasy after hearing movement in the trees.",
"state_delta": {
"trust": 1,
"tension": -0.05
}
}Spoken text routes directly to Azure Speech TTS. Viseme events from Azure drive blend-shape animation frame by frame. Emotion and gesture tags drive the animation state machine. Memory writes persist to Convex for future turns.
- Node.js 20+, pnpm 9+
- Python 3.11+
- A Convex account (
npx convex devto initialize)
pnpm installCreate a .env.local in the repo root (and optionally in services/orchestrator/):
# ── LLM ──────────────────────────────────────────────────────
OPENROUTER_API_KEY= # openrouter.ai — routes Director, Builder, Critic
K2_THINK_API_KEY= # k2think.ai — Director agent (MBZUAI/K2-Think-v2)
K2_THINK_BASE_URL=https://api.k2think.ai/v1 # optional override
# ── Local Seer (ASUS GPU) ────────────────────────────────────
VLLM_SEER_BASE_URL= # your local vLLM endpoint (Qwen3-VL-32B)
VLLM_SEER_API_KEY= # set to "EMPTY" for local vLLM
VLLM_SEER_MODEL=Qwen/Qwen3-VL-32B-Instruct
# ── Voice ────────────────────────────────────────────────────
NEXT_PUBLIC_AZURE_SPEECH_KEY= # Azure Speech resource key
NEXT_PUBLIC_AZURE_SPEECH_REGION= # e.g. eastus
# ── Character pipeline ───────────────────────────────────────
FAL_KEY= # fal.ai — FLUX image generation
MESHY_API_KEY= # meshy.ai — image-to-3D
# ── Persistence ──────────────────────────────────────────────
NEXT_PUBLIC_CONVEX_URL= # from `npx convex dev`Minimum to run locally: OPENROUTER_API_KEY + NEXT_PUBLIC_CONVEX_URL. Everything else degrades gracefully (voice falls back to mocked TTS, character pipeline falls back to placeholder avatars, Seer visual critique is skipped).
# Terminal 1 — Next.js web app
pnpm --filter @world-gen/web dev
# Terminal 2 — Python orchestrator (scene planning)
cd services/orchestrator
python -m uvicorn app.main:app --reload --port 8000
# Terminal 3 — Convex dev server
npx convex devApp runs at http://localhost:3000.
/— character runtime (talk to NPCs, voice I/O, 3D world)/create— prompt-to-world builder with live agent planning trace/preview— deterministic scene manifest preview shell
The Seer agent runs locally via vLLM. See services/vllm-feedback-service/ for the service wrapper. Set VLLM_SEER_BASE_URL to your vLLM endpoint and the scene feedback loop will route visual critique there instead of OpenRouter.
| Layer | Technology |
|---|---|
| Frontend | Next.js 16, React 19, TypeScript |
| 3D Runtime | React Three Fiber, three.js |
| Physics | @react-three/rapier |
| Pathfinding | recast-navigation-js (runtime navmesh) |
| Voice I/O | Azure Speech SDK (STT + TTS + visemes) |
| Persistence | Convex (real-time sync) |
| LLM Layer | OpenRouter (model-agnostic, OpenAI-compatible) |
| Orchestration | Python + FastAPI |
| Image Gen | FAL.ai FLUX |
| Image-to-3D | Meshy API |
| Facial Rigging | Reallusion Character Creator (local service) |
| State | zustand |
| Schema | Zod (shared TypeScript packages) |
| Monorepo | pnpm workspaces + Turborepo |
world-gen/
├── apps/
│ └── web/ ← Next.js + R3F client
│ ├── app/page.tsx ← Character runtime home
│ ├── app/create/page.tsx ← Prompt-to-world builder
│ ├── components/ ← Scene, NPC, player, UI components
│ └── lib/
│ ├── scene/ ← Director, critic, compiler, atmosphere
│ ├── character/ ← Archetypes, spawn plans, store
│ ├── character-pipeline/← Image gen → 3D → facial rig
│ ├── actor/ ← Dialogue context, LLM turn generation
│ ├── voice/ ← Azure Speech wrapper, viseme handling
│ ├── nav/ ← Navmesh integration, pathfinding
│ └── persistence/ ← Convex sync, world snapshots
│
├── packages/
│ ├── scene-schema/ ← Shared Zod DSL schemas (source of truth)
│ ├── asset-registry/ ← Module, prop, character, skybox catalogs
│ └── world-engine/ ← Deterministic validator + compiler
│
├── services/
│ ├── orchestrator/ ← Python FastAPI planning service
│ └── facial-rig-service/ ← GLB post-processing microservice
│
├── convex/ ← Real-time sync schema + mutations
│ ├── schema.ts ← worlds, sessions, characters, memories, events
│ ├── worlds.ts
│ └── characters.ts
│
└── content/
└── scenes/ ← Reference DSL fixtures
├── moonlit-forest-v0.json
├── castle-courtyard-v0.json
└── neon-city-plaza-v0.json
Four world families are currently supported, each with a catalog of connectable layout modules, prop anchors, atmosphere presets, and lighting rigs:
| World Family | Modules | Mood | Characters |
|---|---|---|---|
magical_forest |
Forest paths, clearings, grove circles | Night, tense-but-safe | Wizard companion, forest spirit |
castle_hallway |
Corridors, antechambers, throne rooms | Grand, ominous | Knight, court mage |
neon_city_plaza |
Street blocks, alleyways, overlooks | Cyberpunk, electric | Street vendor, hacker |
wizard_academy |
Classrooms, libraries, courtyards | Warm, scholarly | Student, professor |
Worlds persist via a dual model:
- Local snapshots (
localStorage) for instant resume on reload — no re-generation - Convex real-time sync for cross-device persistence, session state, character memory, and world event logs
The Worlds panel (accessible in-game) lists all saved worlds per session. Each world can be resumed or permanently deleted. Deletion cascades through all associated records — character instances, memories, conversation turns, player states, and events.
Conversation history and character memory are persisted to Convex after every turn and fully restored on world resume. If you talk to character A, switch to B, then return to A — character A has the full prior conversation in context. History is keyed per character per world; switching worlds resets the active context to that world's stored history.
Letting models operate on structure, not code. The hardest constraint to enforce was keeping the planning loop from drifting into arbitrary code generation. Every time the DSL was too sparse, the models would try to "solve" it by inventing fields. Every DSL extension required updating the validator, compiler, and renderer simultaneously — schema discipline was the core engineering challenge.
Dynamic navmesh at runtime. Scenes are composed from modules at runtime, not pre-authored in a game editor. That means no pre-baked navmesh. recast-navigation-js solves this, but merging walkable surfaces across dynamically-placed modules required a geometry resolution pass before any pathfinding queries could run.
Viseme-driven lip sync on generated avatars. Azure Speech produces per-phoneme blend shape weights in real time. But the generated character meshes from Meshy don't come with standard blend shape rigs — we built a facial rig post-processing microservice that attaches a speech-ready blend shape set to any incoming GLB before it reaches the browser.
Streamed agent progress without blocking. The planning loop has multiple LLM calls, tool invocations, and compilation steps. Users shouldn't stare at a spinner. The orchestrator streams structured progress events (planner: selecting modules, tool: validating scene, compiler: merging navmesh) that the frontend surfaces as a live agent trace panel.
- A fully deterministic scene compiler that accepts only valid world descriptions — models can't hallucinate geometry
- Runtime navmesh generation across dynamically-composed module layouts
- End-to-end character pipeline: text prompt → image → 3D mesh → facial rig → talking NPC in one async flow
- Live agentic planning trace visible to the user — the agent swarm's reasoning is never a black box
- Viseme-driven lip sync on procedurally generated avatars, not hand-authored characters
- World persistence and instant resume without re-running the generation pipeline
- A scene critique loop that scores against a formal rubric and feeds back into the director — not vibe-based, but structured quality evaluation
- Unified flow: merge the
/createscene builder and/character runtime into a single experience — enter a world, and the NPCs are already there - Multiplayer: multiple users in the same Convex-backed world instance
- Vision-based critique: rendered screenshots passed to a multimodal seer model for visual quality feedback before the scene loads
- Cross-world continuity: characters remember events from prior sessions across different scenes
- Open world composition: dynamic module stitching as the player explores, not a fixed scene loaded upfront
Python · TypeScript · Next.js · React Three Fiber · three.js · Rapier · recast-navigation-js · Azure Speech SDK · OpenRouter · Convex · FAL.ai · Meshy · Zod · zustand · FastAPI · Turborepo · pnpm
The architecture is not "an LLM writes a 3D app." It's a runtime world engine where models operate over structured descriptions that your engine knows how to validate, compile, and render.