WorldGen: Prompt-Driven 3D Worlds with Agentic Scene Planning

Inspiration

Picture a seven-year-old who just finished a chapter of Harry Potter. She closes the book, looks up at the ceiling, and says: "I want to go there."

Not watch it. Not read about it again. Go there.

That is the gap WorldGen is aimed at. Existing tools get part of the way there, but stop short: videos are passive, chat is text-only, image models make postcards, and full game production is far too heavy. Kids do not imagine in paragraphs. They imagine in places.

There is a deeper use case too. Safe, fantastical worlds can help children practice conversation, confidence, and emotional regulation with lower stakes. A child who struggles to look a classmate in the eye may first try it with a fictional wizard.

So the goal was not "AI generates a 3D screenshot." The goal was a world a child can describe in one sentence, enter quickly, talk to, shape, and return to. The technical approach follows from that: agentic models work over a structured scene format, deterministic systems validate and compile it, and the browser runtime turns it into a playable space with embodied characters.

That's WorldGen. Prompt → Scene Intent → Scene DSL → Compilation → Playable World → Persistent State.

What It Does

WorldGen is a three-layer system:

1. Agentic Scene Planning Pipeline

A Python orchestrator runs a multi-role LLM loop against a typed Scene DSL:

Director model parses user intent and proposes a candidate scene description — which world family, which layout modules, which atmospheric presets, which characters
Deterministic validator enforces hard structural rules: connector compatibility, asset availability, overlap detection, walkability chains, and scene budget
Critic model scores the draft against a formal rubric — prompt alignment, spatial legibility, NPC accessibility, atmosphere coherence — and proposes bounded revisions
Director refines and the loop repeats up to 4 iterations, or until the quality threshold is met
Compiler resolves the accepted spec into a deterministic CompiledSceneManifest: exact module transforms, prop placements, navmesh inputs, character spawn anchors

The models never touch runtime geometry directly. They operate on a constrained DSL, and the engine handles everything else.

Agent Swarm

The planning pipeline is powered by three specialized agents, each with a distinct role:

Agent	Default Model	Role
Director	`MBZUAI-IFM/K2-Think-v2`	Parses user intent, proposes and refines Scene DSL drafts, runs the critic-refine loop
Builder	`google/gemini-3.1-flash-lite-preview`	Powers NPC dialogue, outputs structured actor turns (emotion, gesture, movement intent, memory write)
Seer	`Qwen/Qwen3-VL-32B-Instruct` (local, ASUS GPU)	Multimodal visual critic — inspects rendered scene previews and scores against the quality rubric before the scene loads

All three agents route through the same OpenRouter-compatible API abstraction — swap any model via environment variable without touching orchestration logic. The Seer agent runs locally on the ASUS GPU (vLLM), keeping visual critique off the OpenRouter bill and sub-second for iteration loops.

2. Real-Time 3D Browser Runtime

The compiled manifest loads directly in the browser:

React Three Fiber + three.js renders the full scene with physically-based materials, atmospheric fog, and dynamic lighting
Rapier handles rigid-body physics and player collision
recast-navigation-js generates a navmesh at runtime from the resolved scene geometry — no pre-baked assets, fully dynamic
First-person and orbit camera modes with WASD movement and physics-driven character interactions
NPC embodiment: characters pathfind toward the player, orient to face them, idle with body animation, and trigger dialogue on proximity

3. Embodied Character Runtime

Characters are not chatbots floating over the scene. They are embodied:

Generated avatars from a three-stage pipeline: text prompt → FLUX image generation → Meshy image-to-3D → facial rig post-processing → runtime GLB spawn
Azure Speech TTS with neural voices and viseme-driven lip sync — blend shape events from Azure drive real-time facial animation frame by frame
Azure STT for push-to-talk voice input
Actor model (OpenRouter-routed) drives dialogue and returns structured JSON: spoken text, emotion tag, gesture, movement intent, and memory write
Persistent character memory across turns — characters accumulate context, not just conversation history

Architecture

User Prompt
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│              PYTHON ORCHESTRATOR (FastAPI)               │
│                                                         │
│  Director Model ──▶ Scene DSL Draft                     │
│       ▲                   │                             │
│       │            Deterministic Validator              │
│       │              (schema, assets,                   │
│       │           connectors, walkability)              │
│       │                   │                             │
│  Critic Model ◀──── Scene Quality Rubric                │
│  (prompt alignment, spatial legibility,                 │
│   NPC accessibility, atmosphere coherence)              │
│       │                                                 │
│       └──── max 4 refinement iterations                 │
│                          │                              │
│                    Scene Compiler                       │
│             (resolves transforms, anchors,              │
│              navmesh inputs, spawn points)              │
│                          │                              │
│               CompiledSceneManifest (JSON)              │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│           TYPESCRIPT BROWSER RUNTIME (Next.js)          │
│                                                         │
│  SceneCanvas (R3F)  WorldRoot  PlayerController         │
│  NPCManager         NavMesh    PhysicsWorld              │
│  DialogueOverlay    VoiceInput SyncBridge (Convex)      │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              CHARACTER GENERATION PIPELINE              │
│                                                         │
│  Prompt ──▶ FLUX Image ──▶ Meshy 3D ──▶ Facial Rig     │
│               (FAL.ai)    (image-to-3D)  (blend shapes) │
│                                │                        │
│                    Runtime GLB spawn + lip sync         │
└─────────────────────────────────────────────────────────┘

The Scene DSL

This is the key insight. Models don't write code. They write structured world descriptions that the engine knows how to validate, compile, and render:

{
  "scene_id": "uuid",
  "world_family": "magical_forest",
  "layout": {
    "modules": [
      {
        "module_id": "forest_path_A",
        "instance_id": "m1",
        "transform": { "position": [0, 0, 0], "rotation": [0, 0, 0], "scale": [1, 1, 1] }
      },
      {
        "module_id": "forest_clearing_B",
        "instance_id": "m2",
        "transform": { "position": [0, 0, -18], "rotation": [0, 1.57, 0], "scale": [1, 1, 1] }
      }
    ],
    "connections": [
      { "from": "m1.exit_north", "to": "m2.entry_south" }
    ]
  },
  "atmosphere": {
    "skybox": "night_fog_soft",
    "lighting": "moonlit_blue",
    "fog_density": 0.25,
    "ambient_audio": "forest_wind_low"
  },
  "characters": [
    {
      "character_archetype": "young_supportive_wizard",
      "spawn_anchor": "m2.anchor_npc_01",
      "behavior_profile": "follow_nearby",
      "conversation_profile": "supportive_companion_v1"
    }
  ]
}

The engine validates every field. If a module doesn't exist in the registry, the spec is rejected. If two modules overlap, the compiler catches it. If no path exists from the player spawn to the NPC, the critic flags it before the user ever sees it.

Character Action Vocabulary

Characters don't get free-form animation control. The actor model outputs semantic action tags from a controlled vocabulary:

Action Tag	What It Does
`face_player`	Smooth orient rotation toward player
`take_small_step_closer`	Single short navmesh-guided step
`walk_to_anchor`	Full pathfinding to a named anchor
`listen_neutral`	Idle pose with attention signal
`talk_calm`	Talking body animation, calm variant
`talk_excited`	Talking body animation, high energy
`gesture_reassure`	Hand-raise reassurance gesture
`gesture_point`	Directional point toward object
`look_at_object`	Head/gaze orient to named prop anchor

The runtime maps these to actual clips and motion logic. The model never controls bones. That's how characters look intentional rather than uncanny.

Actor Model Contract

Every NPC dialogue turn returns strict typed JSON:

{
  "spoken_text": "Stay close. I heard it too, but it sounds farther away now.",
  "emotion": "calm_supportive",
  "gesture": "gesture_reassure",
  "movement_intent": "face_player",
  "memory_write": "Player became uneasy after hearing movement in the trees.",
  "state_delta": {
    "trust": 1,
    "tension": -0.05
  }
}

Spoken text routes directly to Azure Speech TTS. Viseme events from Azure drive blend-shape animation frame by frame. Emotion and gesture tags drive the animation state machine. Memory writes persist to Convex for future turns.

Getting Started

Prerequisites

Node.js 20+, pnpm 9+
Python 3.11+
A Convex account (npx convex dev to initialize)

Install

pnpm install

Environment Variables

Create a .env.local in the repo root (and optionally in services/orchestrator/):

# ── LLM ──────────────────────────────────────────────────────
OPENROUTER_API_KEY=           # openrouter.ai — routes Director, Builder, Critic
K2_THINK_API_KEY=             # k2think.ai — Director agent (MBZUAI/K2-Think-v2)
K2_THINK_BASE_URL=https://api.k2think.ai/v1   # optional override

# ── Local Seer (ASUS GPU) ────────────────────────────────────
VLLM_SEER_BASE_URL=           # your local vLLM endpoint (Qwen3-VL-32B)
VLLM_SEER_API_KEY=            # set to "EMPTY" for local vLLM
VLLM_SEER_MODEL=Qwen/Qwen3-VL-32B-Instruct

# ── Voice ────────────────────────────────────────────────────
NEXT_PUBLIC_AZURE_SPEECH_KEY=     # Azure Speech resource key
NEXT_PUBLIC_AZURE_SPEECH_REGION=  # e.g. eastus

# ── Character pipeline ───────────────────────────────────────
FAL_KEY=                      # fal.ai — FLUX image generation
MESHY_API_KEY=                # meshy.ai — image-to-3D

# ── Persistence ──────────────────────────────────────────────
NEXT_PUBLIC_CONVEX_URL=       # from `npx convex dev`

Minimum to run locally: OPENROUTER_API_KEY + NEXT_PUBLIC_CONVEX_URL. Everything else degrades gracefully (voice falls back to mocked TTS, character pipeline falls back to placeholder avatars, Seer visual critique is skipped).

Run

# Terminal 1 — Next.js web app
pnpm --filter @world-gen/web dev

# Terminal 2 — Python orchestrator (scene planning)
cd services/orchestrator
python -m uvicorn app.main:app --reload --port 8000

# Terminal 3 — Convex dev server
npx convex dev

App runs at http://localhost:3000.

/ — character runtime (talk to NPCs, voice I/O, 3D world)
/create — prompt-to-world builder with live agent planning trace
/preview — deterministic scene manifest preview shell

Local Seer (Qwen3-VL-32B on ASUS GPU)

The Seer agent runs locally via vLLM. See services/vllm-feedback-service/ for the service wrapper. Set VLLM_SEER_BASE_URL to your vLLM endpoint and the scene feedback loop will route visual critique there instead of OpenRouter.

Tech Stack

Layer	Technology
Frontend	Next.js 16, React 19, TypeScript
3D Runtime	React Three Fiber, three.js
Physics	@react-three/rapier
Pathfinding	recast-navigation-js (runtime navmesh)
Voice I/O	Azure Speech SDK (STT + TTS + visemes)
Persistence	Convex (real-time sync)
LLM Layer	OpenRouter (model-agnostic, OpenAI-compatible)
Orchestration	Python + FastAPI
Image Gen	FAL.ai FLUX
Image-to-3D	Meshy API
Facial Rigging	Reallusion Character Creator (local service)
State	zustand
Schema	Zod (shared TypeScript packages)
Monorepo	pnpm workspaces + Turborepo

Repository Structure

world-gen/
├── apps/
│   └── web/                      ← Next.js + R3F client
│       ├── app/page.tsx           ← Character runtime home
│       ├── app/create/page.tsx    ← Prompt-to-world builder
│       ├── components/            ← Scene, NPC, player, UI components
│       └── lib/
│           ├── scene/             ← Director, critic, compiler, atmosphere
│           ├── character/         ← Archetypes, spawn plans, store
│           ├── character-pipeline/← Image gen → 3D → facial rig
│           ├── actor/             ← Dialogue context, LLM turn generation
│           ├── voice/             ← Azure Speech wrapper, viseme handling
│           ├── nav/               ← Navmesh integration, pathfinding
│           └── persistence/       ← Convex sync, world snapshots
│
├── packages/
│   ├── scene-schema/              ← Shared Zod DSL schemas (source of truth)
│   ├── asset-registry/            ← Module, prop, character, skybox catalogs
│   └── world-engine/              ← Deterministic validator + compiler
│
├── services/
│   ├── orchestrator/              ← Python FastAPI planning service
│   └── facial-rig-service/        ← GLB post-processing microservice
│
├── convex/                        ← Real-time sync schema + mutations
│   ├── schema.ts                  ← worlds, sessions, characters, memories, events
│   ├── worlds.ts
│   └── characters.ts
│
└── content/
    └── scenes/                    ← Reference DSL fixtures
        ├── moonlit-forest-v0.json
        ├── castle-courtyard-v0.json
        └── neon-city-plaza-v0.json

World Families

Four world families are currently supported, each with a catalog of connectable layout modules, prop anchors, atmosphere presets, and lighting rigs:

World Family	Modules	Mood	Characters
`magical_forest`	Forest paths, clearings, grove circles	Night, tense-but-safe	Wizard companion, forest spirit
`castle_hallway`	Corridors, antechambers, throne rooms	Grand, ominous	Knight, court mage
`neon_city_plaza`	Street blocks, alleyways, overlooks	Cyberpunk, electric	Street vendor, hacker
`wizard_academy`	Classrooms, libraries, courtyards	Warm, scholarly	Student, professor

Persistence Model

Worlds persist via a dual model:

Local snapshots (localStorage) for instant resume on reload — no re-generation
Convex real-time sync for cross-device persistence, session state, character memory, and world event logs

Multi-world support

The Worlds panel (accessible in-game) lists all saved worlds per session. Each world can be resumed or permanently deleted. Deletion cascades through all associated records — character instances, memories, conversation turns, player states, and events.

Character conversation continuity

Conversation history and character memory are persisted to Convex after every turn and fully restored on world resume. If you talk to character A, switch to B, then return to A — character A has the full prior conversation in context. History is keyed per character per world; switching worlds resets the active context to that world's stored history.

Challenges

Letting models operate on structure, not code. The hardest constraint to enforce was keeping the planning loop from drifting into arbitrary code generation. Every time the DSL was too sparse, the models would try to "solve" it by inventing fields. Every DSL extension required updating the validator, compiler, and renderer simultaneously — schema discipline was the core engineering challenge.

Dynamic navmesh at runtime. Scenes are composed from modules at runtime, not pre-authored in a game editor. That means no pre-baked navmesh. recast-navigation-js solves this, but merging walkable surfaces across dynamically-placed modules required a geometry resolution pass before any pathfinding queries could run.

Viseme-driven lip sync on generated avatars. Azure Speech produces per-phoneme blend shape weights in real time. But the generated character meshes from Meshy don't come with standard blend shape rigs — we built a facial rig post-processing microservice that attaches a speech-ready blend shape set to any incoming GLB before it reaches the browser.

Streamed agent progress without blocking. The planning loop has multiple LLM calls, tool invocations, and compilation steps. Users shouldn't stare at a spinner. The orchestrator streams structured progress events (planner: selecting modules, tool: validating scene, compiler: merging navmesh) that the frontend surfaces as a live agent trace panel.

What We're Proud Of

A fully deterministic scene compiler that accepts only valid world descriptions — models can't hallucinate geometry
Runtime navmesh generation across dynamically-composed module layouts
End-to-end character pipeline: text prompt → image → 3D mesh → facial rig → talking NPC in one async flow
Live agentic planning trace visible to the user — the agent swarm's reasoning is never a black box
Viseme-driven lip sync on procedurally generated avatars, not hand-authored characters
World persistence and instant resume without re-running the generation pipeline
A scene critique loop that scores against a formal rubric and feeds back into the director — not vibe-based, but structured quality evaluation

What's Next

Unified flow: merge the /create scene builder and / character runtime into a single experience — enter a world, and the NPCs are already there
Multiplayer: multiple users in the same Convex-backed world instance
Vision-based critique: rendered screenshots passed to a multimodal seer model for visual quality feedback before the scene loads
Cross-world continuity: characters remember events from prior sessions across different scenes
Open world composition: dynamic module stitching as the player explores, not a fixed scene loaded upfront

Built With

Python · TypeScript · Next.js · React Three Fiber · three.js · Rapier · recast-navigation-js · Azure Speech SDK · OpenRouter · Convex · FAL.ai · Meshy · Zod · zustand · FastAPI · Turborepo · pnpm

The architecture is not "an LLM writes a 3D app." It's a runtime world engine where models operate over structured descriptions that your engine knows how to validate, compile, and render.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
apps		apps
content/scenes		content/scenes
convex		convex
packages		packages
services		services
tools		tools
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
run		run
tsconfig.base.json		tsconfig.base.json
turbo.json		turbo.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorldGen: Prompt-Driven 3D Worlds with Agentic Scene Planning

Inspiration

What It Does

1. Agentic Scene Planning Pipeline

Agent Swarm

2. Real-Time 3D Browser Runtime

3. Embodied Character Runtime

Architecture

The Scene DSL

Character Action Vocabulary

Actor Model Contract

Getting Started

Prerequisites

Install

Environment Variables

Run

Local Seer (Qwen3-VL-32B on ASUS GPU)

Tech Stack

Repository Structure

World Families

Persistence Model

Multi-world support

Character conversation continuity

Challenges

What We're Proud Of

What's Next

Built With

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WorldGen: Prompt-Driven 3D Worlds with Agentic Scene Planning

Inspiration

What It Does

1. Agentic Scene Planning Pipeline

Agent Swarm

2. Real-Time 3D Browser Runtime

3. Embodied Character Runtime

Architecture

The Scene DSL

Character Action Vocabulary

Actor Model Contract

Getting Started

Prerequisites

Install

Environment Variables

Run

Local Seer (Qwen3-VL-32B on ASUS GPU)

Tech Stack

Repository Structure

World Families

Persistence Model

Multi-world support

Character conversation continuity

Challenges

What We're Proud Of

What's Next

Built With

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages