Flow
speak a concept, step inside it in 3d
Inspiration
education is stuck with textbooks and powerpoints. what if you could just say "show me ancient rome" and walk around inside it? we wanted spatial learning that feels like stepping into the concept itself.
What It Does
flow converts voice commands into explorable 3d gaussian splat environments. you speak or type a concept, wait ~5 mins while 6 apis chain together, then first-person explore a photorealistic space with educational overlays. press 't' mid-exploration to ask questions about what you're seeing and get voice responses.
the flow:
- deepgram captures your voice → gemini orchestrates educational content
- gemini generates cinematic image → marble converts to gaussian splat
- sparkjs renders .spz file at 60fps → you wasd around with collision detection
- screenshot → gemini vision → elevenlabs narration for contextual q&a
scene library checks local files first (free), then mongodb saved scenes, then generates new. rate limited to prevent api abuse. admins bypass cooldown.
How We Built It
- frontend: react + typescript + three.js + sparkjs for gaussian splat rendering
- backend: express + socket.io for websocket pipeline updates
- storage: vultr object storage for .spz files, mongodb for scene metadata
- apis: deepgram stt → gemini orchestration + image gen → marble 3d conversion → elevenlabs tts
the pipeline runs async with real-time progress (orchestrating → generating_image → creating_world → loading_splat → complete). collision uses sphere-based raycasting against glb meshes. voice q&a screenshots your view, sends to gemini vision, responds via elevenlabs.
Challenges We Ran Into
- deepgram websocket dying instantly until we explicitly declared linear16 pcm at 48khz mono
- gemini model compatibility issues solved with backend proxy and fallback chain
- marble api cors blocked client calls, built express proxy for full async workflow
- collision detection needed multiple raycasts for smooth wall sliding
- converting data uris to file objects for formdata backend upload
Accomplishments That We're Proud Of
6-api integration with real-time websocket feedback for 5-minute world generation. scene library prevents redundant calls. gaussian splats render at 60fps with collision. voice q&a uses gemini vision to answer based on what you're actually looking at. production-ready with rate limiting, auth, error handling.
What We Learned
gaussian splatting enables photorealistic browser 3d without traditional meshes. websockets essential for long async operations. gemini image quality works for 3d conversion when prompts are optimized. backend proxy solves cors and enables rate limiting. scene library system pays off fast for popular concepts.
What's Next
improve collision mesh processing, multi-user collaborative exploration, vr/ar support, ai tutoring guide that follows you through scenes, educator tools for custom experiences, community marketplace for user-generated worlds.
Sponsor API Integration
- deepgram: streaming stt with flux model, voice q&a capture, command pattern matching
- gemini: orchestrates educational content, generates images via 2.0-flash-exp-image-generation, vision api for screenshot analysis, fallback model chain
- elevenlabs: educational narration, voice q&a responses, integrated real-time audio
- mongodb atlas: stores scenes with metadata, scene library queries, user collections
- vultr: object storage for .spz files, thumbnails, collider meshes, cors proxy
Built With
- css
- deepgram
- elevenlabs
- express.js
- firebase
- framer
- gemini
- html
- javascript
- mongodb
- motion
- node.js
- react
- socket.io
- sparkjs
- tailwind
- three.js
- typescript
- vite
- vultr
- webgl
- websocket
- worldlabsmarble
Log in or sign up for Devpost to join the conversation.