AMA - Interactive Story Studio

UI interface editor
system architecture
theater: live node
theater display
sprites creation
dream node
image node
live node
page node
gcp
gcp

Inspiration

What it does# AMA — Interactive Story Studio

Hackathon Submission · Gemini Live Agent Challenge 2026 AMA is a node-based story platform using Gemini. Creators build branching AI narratives with voice and images—no coding needed. Audiences speak to characters to shape any genre in real time.

Inspiration

Every child deserves a story that feels alive — one that listens, responds, and grows with them.

We wanted to give children a truly immersive reading experience, whether the book already has illustrations or is just plain text. With AI, any story can become a rich, visual, and interactive world tailored to each child's imagination.

The spark came from the Gemini Live API and its interleaved generation capability. A storybook doesn't have to be static — characters can speak, react to what a child says, and branch the narrative in real time. A child's response becomes part of the story. Every read-through is a unique adventure.

But building that kind of experience is hard. Generating AI content requires orchestrating dozens of API calls — image generation, video, speech, live streaming — and there's no good tool to tie it all together. So we built one: a studio where creators can upload any book, generate all the multimedia assets they need, and wire everything into an interactive story graph — without writing a single line of code.

What it does

AMA has two layers that work together: a Pipeline Editor for creators to build stories, and a Theater Mode for audiences to experience them.

1. Pipeline Editor (Authoring Tool)

A web-based visual node editor where creators build interactive stories — no coding required:

Start from anything — upload a PDF or use the AI studio interface to describe a story; Gemini reads and understands it automatically and generates the initial story graph
Generate multimedia assets per node: character sprites, animated backgrounds (video via Veo), and narration audio — all powered by Google's generative AI models
Build a branching story graph with drag-and-drop nodes and edges
Centralized asset management — every node is an independent workstation where you create and manage its assets; all assets are shared across the project so you can reuse them freely
Character art consistency — image generation uses Gemini's image reference ability, so characters look the same across every node and scene

The story graph has four node types, each a distinct creative tool:

Node Type	Preview	What it does
Dream Node		Captures real-time audience input via Gemini Live, then uses ✨ Gemini Interleaved Generation to produce story text and images simultaneously — the story writes and illustrates itself as the child speaks
Live Node		Activates Gemini Live API for real-time voice interaction; Gemini navigates story branches based on audience responses
Image Node		Composes characters directly into the background using Gemini image generation — preserving art style consistency
Page Node		Separates characters from the background as transparent sprites for programmatic position, scale, and movement control

2. Theater Mode (Runtime Experience)

The audience-facing stage where the story comes alive:

Background videos play on loop (generated by Veo)
Character sprites animate with emotional states (idle, happy, surprised, and more)
Narration audio plays synchronized with visuals
Gemini Live agent listens to and watches the audience, speaking as the story character in real time
Branching navigation — Gemini decides which story branch to take based on what the audience says or does
Dream sequences — the audience describes something imaginary; Gemini generates a live storyboard and narrates it as images stream in

How we built it

Tech Stack

Layer	Technology
Frontend	React 19.2, TypeScript, Vite, TailwindCSS 4
Node Graph Editor	XYFlow 12
Sprite Rendering	Pixi.js 8 (WebGL)
Backend	FastAPI 0.115, Python 3.12, asyncio
Package Manager	`uv` (fast Python packaging)
Story Understanding	Gemini 3 Flash
Image Generation	Gemini 3 Pro Image
Video Generation	Veo 3.1
Text-to-Speech	Gemini 2.5 Flash TTS
Live Real-Time	Gemini 2.5 Flash Live (audio + vision)
Infra	Google Cloud Run (Gen 2), GCS, Terraform, Cloud Build
Background Removal	`rembg` (ONNX, baked into Docker image)

Key Design Decision: Two-Layer Architecture

We made a deliberate choice to separate authoring (pipeline) from runtime (live agent):

Pipeline stage: all expensive generative assets (video, sprites, audio) are pre-generated and stored, so the child never waits for generation
Live stage: Gemini only handles real-time conversation and navigation decisions — everything else is instant playback

This gives us zero-latency narrative — the AI doesn't think about the plot, only about how to react to the child.

Challenges we ran into

1. Gemini's interleaved multimodal output is fragile — we had to prove it first

Before building Dream Nodes, we ran a dedicated experiment series to validate what the Gemini API actually supports versus what the documentation claims. The results were sobering:

TEXT + IMAGE interleaving works — but only with careful prompting. In multi-turn chat, once image history accumulates, the model silently drops all text output and returns images only. Sending more than one reference image on the first turn triggers the same failure. Fix: explicitly instruct "Respond with the scene text first, then the illustration." in every prompt, and switch from streaming to non-streaming calls for multi-image turns.
TEXT + VIDEO in a single call is impossible. "VIDEO" is not a valid response_modalities value — the API returns a hard 400 immediately.
TEXT + AUDIO interleaving causes hallucination. The model understood it should produce audio, but invented a fake <tool_call> TTS function and returned zero audio bytes. It knows what to do but fakes how to do it.

The conclusion: Gemini's claim of unified TEXT + VIDEO + AUDIO in one call is inaccurate. We had to abandon the idea of a single omnibus generation call and instead orchestrate separate purpose-built APIs in sequence — Flash for text, Pro Image for illustration, Veo for video, and the TTS pipeline for audio. This experiment report directly shaped our two-layer pipeline architecture.

2. Gemini SDK blocking the event loop The Google GenAI SDK makes synchronous blocking calls internally. Wrapping them naively inside FastAPI's async routes deadlocked the server. We fixed this with asyncio.to_thread() to offload blocking calls without freezing the event loop.

3. Veo 3 rate limits (2 RPM, 10 RPD) Video generation is powerful but slow and heavily rate-limited. We built an asset versioning and reuse system so creators never have to regenerate a background unless they intentionally want to — the version picker lets them choose the best generated video across multiple runs.

4. Character art consistency with reference images More reference images do not help — they hurt. Our experiments showed that sending 3 reference images caused the model to hallucinate character details and drop text output entirely. We settled on a single reference image per generation call combined with explicit style instructions to keep characters visually consistent across scenes.

5. Character persona consistency across branching When a child navigates to a different story branch, how does Gemini maintain the character's personality and memory? We designed a "Context Packet" — a dynamic summary of the character's journey injected into the system prompt on every page transition. This remains an active area of refinement.

6. Per-page asset versioning complexity Our initial storage model used global version counters per character. As story branching was added, this broke — a sprite version relevant on page 4 might not be correct for page 7. We redesigned to per-page, per-asset version tracking stored in meta.json.

7. Background removal cold start rembg downloads a ~87MB ONNX model on first use. On Cloud Run with a cold start, this caused 30+ second delays. We baked the model into the Docker image so it is always available from the first request.

Accomplishments that we're proud of

Full end-to-end pipeline: upload a PDF → generate assets → build a branching story → play it live with a real-time AI character. It all works.
Dream Nodes: a child says "I have a pet dragon" and within ~30 seconds Gemini has generated a 3-scene storyboard showing the dragon's adventure, narrating it live while images stream in. This felt genuinely magical in testing.
Zero-latency playback: by pre-generating all assets in the pipeline stage, the theater experience has no loading screens or generation pauses — page transitions are instant.
Production-grade infrastructure: Terraform IaC, Secret Manager, GCS FUSE mount, Cloud Build CI/CD, and Cloud Run Gen 2 — the whole stack is reproducible with a single ./deploy.sh command.
Visual node editor: the XYFlow-based pipeline editor lets non-technical storytellers see their story structure as a graph and modify it visually — no JSON editing required.

What we learned

Gemini Live API is more than a voice chat — the combination of native audio, real-time vision, and tool-calling makes it a genuine AI director capable of understanding context and making decisions, not just responding to prompts.
Pre-generation is the right split point — separating deterministic asset generation from real-time agent behavior is the key to a smooth experience. Trying to generate assets during live interaction kills immersion.
Rate limits shape product design — Veo's 10 requests/day limit forced us to think about asset reuse as a first-class feature, which actually made the authoring tool better.
Children as users require different trust models — the AI "hallucinating" story details is actually fine (it adds to the magic). The standard "AI mistakes are costly" assumption doesn't apply when the domain is creative play.
Infrastructure-first matters for hackathons too — having Terraform and deploy.sh set up early meant we never lost time to deployment issues during crunch time.

What's next for AMA — Interactive Story Studio

Near term:

Better persona consistency — implement the full "Context Packet" system so Gemini carries rich character memory across all branches of the story graph
Faster Dream Nodes — reduce the ~30s wait for imagination generation using streaming interleaved output and background preprocessing
Physical projection mode — connect to a projector and add OpenCV hand-tracking so children can "touch" characters on the wall

Medium term:

Music generation — integrate Lyria (Google's music generation model) to generate adaptive background music that responds to story mood
Text-to-story ingestion — allow creators to paste plain text (not just PDFs) to bootstrap a story graph automatically
Multiplayer — support multiple children in the same room, with Gemini managing a group dynamic instead of one-on-one

Longer term:

Projector + camera calibration — full spatial mapping so characters can be projected to specific locations on a wall and respond to the child's physical position
Any-book generalization — point a camera at any physical picture book; Gemini extracts characters and builds a live story graph on the fly (the original LuminaPages vision)
Model routing — use smaller/cheaper models (Flash) for simple narration, reserve more capable models for complex branching decisions, reducing cost per session

Built With

Google Gemini Live API — real-time audio + vision + tool-calling
Veo 3.1 — video background generation
Gemini 2.5 Flash TTS — character voice narration
React + Vite — frontend framework
XYFlow — visual node graph editor
Pixi.js — WebGL sprite rendering
FastAPI — async Python backend
Google Cloud Run — serverless container hosting
Terraform — infrastructure as code

Built With

Updates

Tzu-Han Lin started this project — Mar 16, 2026 06:58 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.