Skip to content

mihirt2/world-gen

Repository files navigation

WorldGen: Prompt-Driven 3D Worlds with Agentic Scene Planning


Inspiration

Picture a seven-year-old who just finished a chapter of Harry Potter. She closes the book, looks up at the ceiling, and says: "I want to go there."

Not watch it. Not read about it again. Go there.

That is the gap WorldGen is aimed at. Existing tools get part of the way there, but stop short: videos are passive, chat is text-only, image models make postcards, and full game production is far too heavy. Kids do not imagine in paragraphs. They imagine in places.

There is a deeper use case too. Safe, fantastical worlds can help children practice conversation, confidence, and emotional regulation with lower stakes. A child who struggles to look a classmate in the eye may first try it with a fictional wizard.

So the goal was not "AI generates a 3D screenshot." The goal was a world a child can describe in one sentence, enter quickly, talk to, shape, and return to. The technical approach follows from that: agentic models work over a structured scene format, deterministic systems validate and compile it, and the browser runtime turns it into a playable space with embodied characters.

That's WorldGen. Prompt → Scene Intent → Scene DSL → Compilation → Playable World → Persistent State.


What It Does

WorldGen is a three-layer system:

1. Agentic Scene Planning Pipeline

A Python orchestrator runs a multi-role LLM loop against a typed Scene DSL:

  • Director model parses user intent and proposes a candidate scene description — which world family, which layout modules, which atmospheric presets, which characters
  • Deterministic validator enforces hard structural rules: connector compatibility, asset availability, overlap detection, walkability chains, and scene budget
  • Critic model scores the draft against a formal rubric — prompt alignment, spatial legibility, NPC accessibility, atmosphere coherence — and proposes bounded revisions
  • Director refines and the loop repeats up to 4 iterations, or until the quality threshold is met
  • Compiler resolves the accepted spec into a deterministic CompiledSceneManifest: exact module transforms, prop placements, navmesh inputs, character spawn anchors

The models never touch runtime geometry directly. They operate on a constrained DSL, and the engine handles everything else.

Agent Swarm

The planning pipeline is powered by three specialized agents, each with a distinct role:

Agent Default Model Role
Director MBZUAI-IFM/K2-Think-v2 Parses user intent, proposes and refines Scene DSL drafts, runs the critic-refine loop
Builder google/gemini-3.1-flash-lite-preview Powers NPC dialogue, outputs structured actor turns (emotion, gesture, movement intent, memory write)
Seer Qwen/Qwen3-VL-32B-Instruct (local, ASUS GPU) Multimodal visual critic — inspects rendered scene previews and scores against the quality rubric before the scene loads

All three agents route through the same OpenRouter-compatible API abstraction — swap any model via environment variable without touching orchestration logic. The Seer agent runs locally on the ASUS GPU (vLLM), keeping visual critique off the OpenRouter bill and sub-second for iteration loops.

2. Real-Time 3D Browser Runtime

The compiled manifest loads directly in the browser:

  • React Three Fiber + three.js renders the full scene with physically-based materials, atmospheric fog, and dynamic lighting
  • Rapier handles rigid-body physics and player collision
  • recast-navigation-js generates a navmesh at runtime from the resolved scene geometry — no pre-baked assets, fully dynamic
  • First-person and orbit camera modes with WASD movement and physics-driven character interactions
  • NPC embodiment: characters pathfind toward the player, orient to face them, idle with body animation, and trigger dialogue on proximity

3. Embodied Character Runtime

Characters are not chatbots floating over the scene. They are embodied:

  • Generated avatars from a three-stage pipeline: text prompt → FLUX image generation → Meshy image-to-3D → facial rig post-processing → runtime GLB spawn
  • Azure Speech TTS with neural voices and viseme-driven lip sync — blend shape events from Azure drive real-time facial animation frame by frame
  • Azure STT for push-to-talk voice input
  • Actor model (OpenRouter-routed) drives dialogue and returns structured JSON: spoken text, emotion tag, gesture, movement intent, and memory write
  • Persistent character memory across turns — characters accumulate context, not just conversation history

Architecture

User Prompt
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│              PYTHON ORCHESTRATOR (FastAPI)               │
│                                                         │
│  Director Model ──▶ Scene DSL Draft                     │
│       ▲                   │                             │
│       │            Deterministic Validator              │
│       │              (schema, assets,                   │
│       │           connectors, walkability)              │
│       │                   │                             │
│  Critic Model ◀──── Scene Quality Rubric                │
│  (prompt alignment, spatial legibility,                 │
│   NPC accessibility, atmosphere coherence)              │
│       │                                                 │
│       └──── max 4 refinement iterations                 │
│                          │                              │
│                    Scene Compiler                       │
│             (resolves transforms, anchors,              │
│              navmesh inputs, spawn points)              │
│                          │                              │
│               CompiledSceneManifest (JSON)              │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│           TYPESCRIPT BROWSER RUNTIME (Next.js)          │
│                                                         │
│  SceneCanvas (R3F)  WorldRoot  PlayerController         │
│  NPCManager         NavMesh    PhysicsWorld              │
│  DialogueOverlay    VoiceInput SyncBridge (Convex)      │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              CHARACTER GENERATION PIPELINE              │
│                                                         │
│  Prompt ──▶ FLUX Image ──▶ Meshy 3D ──▶ Facial Rig     │
│               (FAL.ai)    (image-to-3D)  (blend shapes) │
│                                │                        │
│                    Runtime GLB spawn + lip sync         │
└─────────────────────────────────────────────────────────┘

The Scene DSL

This is the key insight. Models don't write code. They write structured world descriptions that the engine knows how to validate, compile, and render:

{
  "scene_id": "uuid",
  "world_family": "magical_forest",
  "layout": {
    "modules": [
      {
        "module_id": "forest_path_A",
        "instance_id": "m1",
        "transform": { "position": [0, 0, 0], "rotation": [0, 0, 0], "scale": [1, 1, 1] }
      },
      {
        "module_id": "forest_clearing_B",
        "instance_id": "m2",
        "transform": { "position": [0, 0, -18], "rotation": [0, 1.57, 0], "scale": [1, 1, 1] }
      }
    ],
    "connections": [
      { "from": "m1.exit_north", "to": "m2.entry_south" }
    ]
  },
  "atmosphere": {
    "skybox": "night_fog_soft",
    "lighting": "moonlit_blue",
    "fog_density": 0.25,
    "ambient_audio": "forest_wind_low"
  },
  "characters": [
    {
      "character_archetype": "young_supportive_wizard",
      "spawn_anchor": "m2.anchor_npc_01",
      "behavior_profile": "follow_nearby",
      "conversation_profile": "supportive_companion_v1"
    }
  ]
}

The engine validates every field. If a module doesn't exist in the registry, the spec is rejected. If two modules overlap, the compiler catches it. If no path exists from the player spawn to the NPC, the critic flags it before the user ever sees it.


Character Action Vocabulary

Characters don't get free-form animation control. The actor model outputs semantic action tags from a controlled vocabulary:

Action Tag What It Does
face_player Smooth orient rotation toward player
take_small_step_closer Single short navmesh-guided step
walk_to_anchor Full pathfinding to a named anchor
listen_neutral Idle pose with attention signal
talk_calm Talking body animation, calm variant
talk_excited Talking body animation, high energy
gesture_reassure Hand-raise reassurance gesture
gesture_point Directional point toward object
look_at_object Head/gaze orient to named prop anchor

The runtime maps these to actual clips and motion logic. The model never controls bones. That's how characters look intentional rather than uncanny.


Actor Model Contract

Every NPC dialogue turn returns strict typed JSON:

{
  "spoken_text": "Stay close. I heard it too, but it sounds farther away now.",
  "emotion": "calm_supportive",
  "gesture": "gesture_reassure",
  "movement_intent": "face_player",
  "memory_write": "Player became uneasy after hearing movement in the trees.",
  "state_delta": {
    "trust": 1,
    "tension": -0.05
  }
}

Spoken text routes directly to Azure Speech TTS. Viseme events from Azure drive blend-shape animation frame by frame. Emotion and gesture tags drive the animation state machine. Memory writes persist to Convex for future turns.


Getting Started

Prerequisites

  • Node.js 20+, pnpm 9+
  • Python 3.11+
  • A Convex account (npx convex dev to initialize)

Install

pnpm install

Environment Variables

Create a .env.local in the repo root (and optionally in services/orchestrator/):

# ── LLM ──────────────────────────────────────────────────────
OPENROUTER_API_KEY=           # openrouter.ai — routes Director, Builder, Critic
K2_THINK_API_KEY=             # k2think.ai — Director agent (MBZUAI/K2-Think-v2)
K2_THINK_BASE_URL=https://api.k2think.ai/v1   # optional override

# ── Local Seer (ASUS GPU) ────────────────────────────────────
VLLM_SEER_BASE_URL=           # your local vLLM endpoint (Qwen3-VL-32B)
VLLM_SEER_API_KEY=            # set to "EMPTY" for local vLLM
VLLM_SEER_MODEL=Qwen/Qwen3-VL-32B-Instruct

# ── Voice ────────────────────────────────────────────────────
NEXT_PUBLIC_AZURE_SPEECH_KEY=     # Azure Speech resource key
NEXT_PUBLIC_AZURE_SPEECH_REGION=  # e.g. eastus

# ── Character pipeline ───────────────────────────────────────
FAL_KEY=                      # fal.ai — FLUX image generation
MESHY_API_KEY=                # meshy.ai — image-to-3D

# ── Persistence ──────────────────────────────────────────────
NEXT_PUBLIC_CONVEX_URL=       # from `npx convex dev`

Minimum to run locally: OPENROUTER_API_KEY + NEXT_PUBLIC_CONVEX_URL. Everything else degrades gracefully (voice falls back to mocked TTS, character pipeline falls back to placeholder avatars, Seer visual critique is skipped).

Run

# Terminal 1 — Next.js web app
pnpm --filter @world-gen/web dev

# Terminal 2 — Python orchestrator (scene planning)
cd services/orchestrator
python -m uvicorn app.main:app --reload --port 8000

# Terminal 3 — Convex dev server
npx convex dev

App runs at http://localhost:3000.

  • / — character runtime (talk to NPCs, voice I/O, 3D world)
  • /create — prompt-to-world builder with live agent planning trace
  • /preview — deterministic scene manifest preview shell

Local Seer (Qwen3-VL-32B on ASUS GPU)

The Seer agent runs locally via vLLM. See services/vllm-feedback-service/ for the service wrapper. Set VLLM_SEER_BASE_URL to your vLLM endpoint and the scene feedback loop will route visual critique there instead of OpenRouter.


Tech Stack

Layer Technology
Frontend Next.js 16, React 19, TypeScript
3D Runtime React Three Fiber, three.js
Physics @react-three/rapier
Pathfinding recast-navigation-js (runtime navmesh)
Voice I/O Azure Speech SDK (STT + TTS + visemes)
Persistence Convex (real-time sync)
LLM Layer OpenRouter (model-agnostic, OpenAI-compatible)
Orchestration Python + FastAPI
Image Gen FAL.ai FLUX
Image-to-3D Meshy API
Facial Rigging Reallusion Character Creator (local service)
State zustand
Schema Zod (shared TypeScript packages)
Monorepo pnpm workspaces + Turborepo

Repository Structure

world-gen/
├── apps/
│   └── web/                      ← Next.js + R3F client
│       ├── app/page.tsx           ← Character runtime home
│       ├── app/create/page.tsx    ← Prompt-to-world builder
│       ├── components/            ← Scene, NPC, player, UI components
│       └── lib/
│           ├── scene/             ← Director, critic, compiler, atmosphere
│           ├── character/         ← Archetypes, spawn plans, store
│           ├── character-pipeline/← Image gen → 3D → facial rig
│           ├── actor/             ← Dialogue context, LLM turn generation
│           ├── voice/             ← Azure Speech wrapper, viseme handling
│           ├── nav/               ← Navmesh integration, pathfinding
│           └── persistence/       ← Convex sync, world snapshots
│
├── packages/
│   ├── scene-schema/              ← Shared Zod DSL schemas (source of truth)
│   ├── asset-registry/            ← Module, prop, character, skybox catalogs
│   └── world-engine/              ← Deterministic validator + compiler
│
├── services/
│   ├── orchestrator/              ← Python FastAPI planning service
│   └── facial-rig-service/        ← GLB post-processing microservice
│
├── convex/                        ← Real-time sync schema + mutations
│   ├── schema.ts                  ← worlds, sessions, characters, memories, events
│   ├── worlds.ts
│   └── characters.ts
│
└── content/
    └── scenes/                    ← Reference DSL fixtures
        ├── moonlit-forest-v0.json
        ├── castle-courtyard-v0.json
        └── neon-city-plaza-v0.json

World Families

Four world families are currently supported, each with a catalog of connectable layout modules, prop anchors, atmosphere presets, and lighting rigs:

World Family Modules Mood Characters
magical_forest Forest paths, clearings, grove circles Night, tense-but-safe Wizard companion, forest spirit
castle_hallway Corridors, antechambers, throne rooms Grand, ominous Knight, court mage
neon_city_plaza Street blocks, alleyways, overlooks Cyberpunk, electric Street vendor, hacker
wizard_academy Classrooms, libraries, courtyards Warm, scholarly Student, professor

Persistence Model

Worlds persist via a dual model:

  • Local snapshots (localStorage) for instant resume on reload — no re-generation
  • Convex real-time sync for cross-device persistence, session state, character memory, and world event logs

Multi-world support

The Worlds panel (accessible in-game) lists all saved worlds per session. Each world can be resumed or permanently deleted. Deletion cascades through all associated records — character instances, memories, conversation turns, player states, and events.

Character conversation continuity

Conversation history and character memory are persisted to Convex after every turn and fully restored on world resume. If you talk to character A, switch to B, then return to A — character A has the full prior conversation in context. History is keyed per character per world; switching worlds resets the active context to that world's stored history.


Challenges

Letting models operate on structure, not code. The hardest constraint to enforce was keeping the planning loop from drifting into arbitrary code generation. Every time the DSL was too sparse, the models would try to "solve" it by inventing fields. Every DSL extension required updating the validator, compiler, and renderer simultaneously — schema discipline was the core engineering challenge.

Dynamic navmesh at runtime. Scenes are composed from modules at runtime, not pre-authored in a game editor. That means no pre-baked navmesh. recast-navigation-js solves this, but merging walkable surfaces across dynamically-placed modules required a geometry resolution pass before any pathfinding queries could run.

Viseme-driven lip sync on generated avatars. Azure Speech produces per-phoneme blend shape weights in real time. But the generated character meshes from Meshy don't come with standard blend shape rigs — we built a facial rig post-processing microservice that attaches a speech-ready blend shape set to any incoming GLB before it reaches the browser.

Streamed agent progress without blocking. The planning loop has multiple LLM calls, tool invocations, and compilation steps. Users shouldn't stare at a spinner. The orchestrator streams structured progress events (planner: selecting modules, tool: validating scene, compiler: merging navmesh) that the frontend surfaces as a live agent trace panel.


What We're Proud Of

  • A fully deterministic scene compiler that accepts only valid world descriptions — models can't hallucinate geometry
  • Runtime navmesh generation across dynamically-composed module layouts
  • End-to-end character pipeline: text prompt → image → 3D mesh → facial rig → talking NPC in one async flow
  • Live agentic planning trace visible to the user — the agent swarm's reasoning is never a black box
  • Viseme-driven lip sync on procedurally generated avatars, not hand-authored characters
  • World persistence and instant resume without re-running the generation pipeline
  • A scene critique loop that scores against a formal rubric and feeds back into the director — not vibe-based, but structured quality evaluation

What's Next

  • Unified flow: merge the /create scene builder and / character runtime into a single experience — enter a world, and the NPCs are already there
  • Multiplayer: multiple users in the same Convex-backed world instance
  • Vision-based critique: rendered screenshots passed to a multimodal seer model for visual quality feedback before the scene loads
  • Cross-world continuity: characters remember events from prior sessions across different scenes
  • Open world composition: dynamic module stitching as the player explores, not a fixed scene loaded upfront

Built With

Python · TypeScript · Next.js · React Three Fiber · three.js · Rapier · recast-navigation-js · Azure Speech SDK · OpenRouter · Convex · FAL.ai · Meshy · Zod · zustand · FastAPI · Turborepo · pnpm


The architecture is not "an LLM writes a 3D app." It's a runtime world engine where models operate over structured descriptions that your engine knows how to validate, compile, and render.

About

Yale Hack 2026

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors