GME | Devpost

Inspiration

What if we could literally watch an AI build its own world, piece by piece, while explaining its thought process to us? We've seen AI generate text, images, and single 3D models, but the concept of an agentic, spatial environment builder is still largely unexplored. GME (Gemini Universe Engine) was inspired by the desire to break out of the traditional "text box" agent paradigm. I wanted users to witness a multi-agent system collaborating in a live, 3D space, where one agent acts as the hands and eyes (The Architect), and another serves as the omniscient narrator (The Narrator).

What it does

GME is a real-time, immersive generative engine running on a geodesic hex sphere containing 5,882 nodes. It uses a Dual-Agent Architecture:

The Architect (Gemini 3.1 Flash) visually observes the 3D canvas via 768x768 WebGL screenshots and emits structured JSON actions to dynamically spawn objects, modify terrain elevation, and alter colors based on the user's chosen theme.
The Narrator (Gemini 3.1 Flash-Lite) uses the Gemini Live API over WebSocket to stream a continuous, vivid voiceover of the world being built in real-time.

Behind the scenes, the Architect also dreams up 2D concept blueprints using Imagen 3, which are processed by a dedicated VM running TripoSR and Hunyuan3D-2.1 to inject entirely new, AI-generated 3D meshes into the world .

How we built it

The project is built entirely on Google Cloud Platform, utilizing a modern microservices approach:

Frontend: Built with React 19, Vite, Three.js, React Three Fiber, and Zustand for state. It handles the hex math (via H3) and Rapier physics.
Gateway: A Node.js Express server handling rate-limiting and acting as a reverse proxy for HTTP and WebSocket connections.
ADK Backend: A Python FastAPI service leveraging the Google Agent Development Kit (ADK) to orchestrate the agents. It uses ChromaDB and Gemini Embedding 2 for long-term semantic memory of the universe's state.
Shape3D VM: An A100 GPU compute instance running Gradio to expose fast image-to-3D models (TripoSR/Hunyuan3D).

We tied it all together with Cloud Build for automated CI/CD directly to Cloud Run, ensuring the system scales and deploys effortlessly.

Challenges we ran into

1. Context Window vs. Visual Speed: Getting the Architect agent to understand a massive 3D space was tough. Instead of sending an overwhelming array of 5,882 hex coordinates, we switched to a multimodal approach: we send the agent a live 768x768 pixel screenshot of the WebGL canvas on every tick, drastically dropping token usage while massively improving spatial reasoning. 2. The Shape3D Pipeline: Generating 3D objects in real-time is notoriously slow. We had to build a complex asynchronous pipeline where Vision models isolate 2D objects from a blueprint, send them to a dedicated GPU VM, and return Base64 GLB meshes without blocking the main Agent generation loop. 3. Real-Time Audio WebSocket: Implementing the Gemini Live API required a reliable two-way WebSocket relay through our Node.js gateway to our Python backend to ensure the audio stream from the Narrator Agent reached the browser flawlessly.

Accomplishments that we're proud of

Synchronizing a massive React Three Fiber physics sphere with constant, live agent actions without frame drops.
successfully routing the new Gemini Live API all the way from the python backend through the web gateway to play live audio in the browser.
Building a full end-to-end AI pipeline that goes Prompt $\rightarrow$ Text $\rightarrow$ Imagen $\rightarrow$ Vision Image Crop $\rightarrow$ VM 3D Generation $\rightarrow$ Live WebGL Render.

What we learned

We learned that multimodal grounding is the key to agentic systems. By giving the Architect agent the ability to actually see the UI screenshot rather than just reading a massive JSON state file, its reasoning capabilities and thematic coherence skyrocketed. We also learned how powerful the new Gemini Live API is for creating ambient, non-blocking user experiences.

What's next for GME

We plan to expand the architecture into a multiplayer experience where multiple users can prompt different Architect agents that must compete or collaborate to build the dominant biome on the hex sphere. We also plan to optimize the image-to-3D pipeline to run entirely on edge devices or faster T4 endpoints.