AMA — Interactive Story Studio

Hackathon Submission · Gemini Live Agent Challenge 2026 AMA is a node-based story platform using Gemini. Creators build branching AI narratives with voice and images—no coding needed. Audiences speak to characters to shape any genre in real time.

Inspiration

Every child deserves a story that feels alive — one that listens, responds, and grows with them.

We wanted to give children a truly immersive reading experience, whether the book already has illustrations or is just plain text. With AI, any story can become a rich, visual, and interactive world tailored to each child's imagination.

The spark came from the Gemini Live API and its interleaved generation capability. A storybook doesn't have to be static — characters can speak, react to what a child says, and branch the narrative in real time. A child's response becomes part of the story. Every read-through is a unique adventure.

But building that kind of experience is hard. Generating AI content requires orchestrating dozens of API calls — image generation, video, speech, live streaming — and there's no good tool to tie it all together. So we built one: a studio where creators can upload any book, generate all the multimedia assets they need, and wire everything into an interactive story graph — without writing a single line of code.

System UI

What it does

AMA has two layers that work together: a Pipeline Editor for creators to build stories, and a Theater Mode for audiences to experience them.

1. Pipeline Editor (Authoring Tool)

A web-based visual node editor where creators build interactive stories — no coding required:

Start from anything — upload a PDF or use the AI studio interface to describe a story; Gemini reads and understands it automatically and generates the initial story graph
Generate multimedia assets per node: character sprites, animated backgrounds (video via Veo), and narration audio — all powered by Google's generative AI models
Build a branching story graph with drag-and-drop nodes and edges
Centralized asset management — every node is an independent workstation where you create and manage its assets; all assets are shared across the project so you can reuse them freely
Character art consistency — image generation uses Gemini's image reference ability, so characters look the same across every node and scene

The story graph has four node types, each a distinct creative tool:

Node Type	Preview	What it does
Dream Node		Captures real-time audience input via Gemini Live, then uses ✨ Gemini Interleaved Generation to produce story text and images simultaneously — the story writes and illustrates itself as the child speaks
Live Node		Activates Gemini Live API for real-time voice interaction; Gemini navigates story branches based on audience responses
Image Node		Composes characters directly into the background using Gemini image generation — preserving art style consistency
Page Node		Separates characters from the background as transparent sprites for programmatic position, scale, and movement control

2. Theater Mode (Runtime Experience)

The audience-facing stage where the story comes alive:

Background videos play on loop (generated by Veo)
Character sprites animate with emotional states (idle, happy, surprised, and more)
Narration audio plays synchronized with visuals
Gemini Live agent listens to and watches the audience, speaking as the story character in real time
Branching navigation — Gemini decides which story branch to take based on what the audience says or does
Dream sequences — the audience describes something imaginary; Gemini generates a live storyboard and narrates it as images stream in

Theater mode

image with tts

live node: theater mode with live api listen to audience

How we built it

Architecture

Tech Stack

Layer	Technology
Frontend	React 19.2, TypeScript, Vite, TailwindCSS 4
Node Graph Editor	XYFlow 12
Sprite Rendering	Pixi.js 8 (WebGL)
Backend	FastAPI 0.115, Python 3.12, asyncio
Package Manager	`uv` (fast Python packaging)
Story Understanding	Gemini 3 Flash
Image Generation	Gemini 3 Pro Image
Video Generation	Veo 3.1
Text-to-Speech	Gemini 2.5 Flash TTS
Live Real-Time	Gemini 2.5 Flash Live (audio + vision)
Infra	Google Cloud Run (Gen 2), GCS, Terraform, Cloud Build
Background Removal	`rembg` (ONNX, baked into Docker image)

Key Design Decision: Two-Layer Architecture

We made a deliberate choice to separate authoring (pipeline) from runtime (live agent):

Pipeline stage: all expensive generative assets (video, sprites, audio) are pre-generated and stored, so the child never waits for generation
Live stage: Gemini only handles real-time conversation and navigation decisions — everything else is instant playback

This gives us zero-latency narrative — the AI doesn't think about the plot, only about how to react to the child.

Challenges we ran into

1. Gemini's interleaved multimodal output is fragile — we had to prove it first

Before building Dream Nodes, we ran a dedicated experiment series to validate what the Gemini API actually supports versus what the documentation claims. The results were sobering:

TEXT + IMAGE interleaving works — but only with careful prompting. In multi-turn chat, once image history accumulates, the model silently drops all text output and returns images only. Sending more than one reference image on the first turn triggers the same failure. Fix: explicitly instruct "Respond with the scene text first, then the illustration." in every prompt, and switch from streaming to non-streaming calls for multi-image turns.
TEXT + VIDEO in a single call is impossible. "VIDEO" is not a valid response_modalities value — the API returns a hard 400 immediately.
TEXT + AUDIO interleaving causes hallucination. The model understood it should produce audio, but invented a fake <tool_call> TTS function and returned zero audio bytes. It knows what to do but fakes how to do it.

The conclusion: Gemini's claim of unified TEXT + VIDEO + AUDIO in one call is inaccurate. We had to abandon the idea of a single omnibus generation call and instead orchestrate separate purpose-built APIs in sequence — Flash for text, Pro Image for illustration, Veo for video, and the TTS pipeline for audio. This experiment report directly shaped our two-layer pipeline architecture.

2. Gemini SDK blocking the event loop The Google GenAI SDK makes synchronous blocking calls internally. Wrapping them naively inside FastAPI's async routes deadlocked the server. We fixed this with asyncio.to_thread() to offload blocking calls without freezing the event loop.

3. Veo 3 rate limits (2 RPM, 10 RPD) Video generation is powerful but slow and heavily rate-limited. We built an asset versioning and reuse system so creators never have to regenerate a background unless they intentionally want to — the version picker lets them choose the best generated video across multiple runs.

4. Character art consistency with reference images More reference images do not help — they hurt. Our experiments showed that sending 3 reference images caused the model to hallucinate character details and drop text output entirely. We settled on a single reference image per generation call combined with explicit style instructions to keep characters visually consistent across scenes.

5. Character persona consistency across branching When a child navigates to a different story branch, how does Gemini maintain the character's personality and memory? We designed a "Context Packet" — a dynamic summary of the character's journey injected into the system prompt on every page transition. This remains an active area of refinement.

6. Per-page asset versioning complexity Our initial storage model used global version counters per character. As story branching was added, this broke — a sprite version relevant on page 4 might not be correct for page 7. We redesigned to per-page, per-asset version tracking stored in meta.json.

7. Background removal cold start rembg downloads a ~87MB ONNX model on first use. On Cloud Run with a cold start, this caused 30+ second delays. We baked the model into the Docker image so it is always available from the first request.

Accomplishments that we're proud of

Full end-to-end pipeline: upload a PDF → generate assets → build a branching story → play it live with a real-time AI character. It all works.
Dream Nodes: a child says "I have a pet dragon" and within ~30 seconds Gemini has generated a 3-scene storyboard showing the dragon's adventure, narrating it live while images stream in. This felt genuinely magical in testing.
Zero-latency playback: by pre-generating all assets in the pipeline stage, the theater experience has no loading screens or generation pauses — page transitions are instant.
Production-grade infrastructure: Terraform IaC, Secret Manager, GCS FUSE mount, Cloud Build CI/CD, and Cloud Run Gen 2 — the whole stack is reproducible with a single ./deploy.sh command.
Visual node editor: the XYFlow-based pipeline editor lets non-technical storytellers see their story structure as a graph and modify it visually — no JSON editing required.

What we learned

Gemini Live API is more than a voice chat — the combination of native audio, real-time vision, and tool-calling makes it a genuine AI director capable of understanding context and making decisions, not just responding to prompts.
Pre-generation is the right split point — separating deterministic asset generation from real-time agent behavior is the key to a smooth experience. Trying to generate assets during live interaction kills immersion.
Rate limits shape product design — Veo's 10 requests/day limit forced us to think about asset reuse as a first-class feature, which actually made the authoring tool better.
Children as users require different trust models — the AI "hallucinating" story details is actually fine (it adds to the magic). The standard "AI mistakes are costly" assumption doesn't apply when the domain is creative play.
Infrastructure-first matters for hackathons too — having Terraform and deploy.sh set up early meant we never lost time to deployment issues during crunch time.

What's next for AMA — Interactive Story Studio

Near term:

Better persona consistency — implement the full "Context Packet" system so Gemini carries rich character memory across all branches of the story graph
Faster Dream Nodes — reduce the ~30s wait for imagination generation using streaming interleaved output and background preprocessing
Physical projection mode — connect to a projector and add OpenCV hand-tracking so children can "touch" characters on the wall

Medium term:

Music generation — integrate Lyria (Google's music generation model) to generate adaptive background music that responds to story mood
Text-to-story ingestion — allow creators to paste plain text (not just PDFs) to bootstrap a story graph automatically
Multiplayer — support multiple children in the same room, with Gemini managing a group dynamic instead of one-on-one

Longer term:

Projector + camera calibration — full spatial mapping so characters can be projected to specific locations on a wall and respond to the child's physical position
Any-book generalization — point a camera at any physical picture book; Gemini extracts characters and builds a live story graph on the fly (the original LuminaPages vision)
Model routing — use smaller/cheaper models (Flash) for simple narration, reserve more capable models for complex branching decisions, reducing cost per session

Test Instructions

You can try AMA live without any local setup:

Navigate to the app: https://ama-api-525glmjwra-uc.a.run.app/
Create a new project or open the demo project directly: https://ama-api-525glmjwra-uc.a.run.app/pipeline/20260314_201134_untitled
Enter Theater Mode using the button in the top-left corner of the Pipeline Editor.
Testing the Dream Node: The Dream Node uses real-time camera input. Before entering Theater Mode, open the camera page and select the same project ID (20260314_201134_untitled) so your camera stream is sent to the server: https://ama-api-525glmjwra-uc.a.run.app/camera

The camera page must have the correct project ID selected to route your video feed into the Dream Node during the theater experience.

How to Run

Prerequisites

Python 3.12+
Node.js 18+
A Gemini API key

Local Development

Backend:

cd AMA/backend

# Copy environment template and add your API key
cp .env.example .env
# Edit .env: set GEMINI_API_KEY=your_key_here

# Install dependencies with uv (fast Python package manager)
pip install uv
uv sync

# Start the server (mock mode on by default — no real API calls)
uv run uvicorn main:app --reload --port 8000

The backend will be available at http://localhost:8000.

Frontend:

cd AMA/frontend

npm install
npm run dev

The app will be available at http://localhost:5173. The dev server proxies API calls to localhost:8000.

Environment Variables

Create AMA/backend/.env:

GEMINI_API_KEY=your_gemini_api_key_here

# Mock mode (default: true) — uses test fixtures, no real API calls
# Set to false to use real Gemini/Veo models
MOCK_MODE=true

# Vertex AI (optional) — use Vertex AI instead of Gemini API directly
GOOGLE_GENAI_USE_VERTEXAI=false
# GOOGLE_CLOUD_PROJECT=your-gcp-project-id
# GOOGLE_CLOUD_LOCATION=us-central1

Start with MOCK_MODE=true to explore the app without spending API credits. Switch to false when you're ready to generate real assets.

Using the App (Local)

Open http://localhost:5173
Create a new project or upload a PDF storybook
In the Pipeline Editor, run each stage in order:
- Story Understanding → extracts pages, characters, and structure
- Sprite Generation → creates character PNG sprites
- Background Generation → generates video loops (watch Veo rate limits!)
- TTS Narration → generates audio for each page
Click on any story node to edit its assets, script, and characters
Add edges between nodes to create branching paths
Switch to Theater Mode and select a starting node to play

How to Deploy

Prerequisites

# Install tools
brew install terraform gcloud

# Authenticate with Google Cloud
gcloud auth login
gcloud auth application-default login
gcloud config set project <name>

One-Command Deploy

cd AMA

# Set your API key (never commit this to git)
export GEMINI_API_KEY="your-gemini-api-key"

# Deploy everything
./deploy.sh

The script automates:

Push GEMINI_API_KEY to Google Secret Manager
Run Terraform to enable GCP APIs and create Artifact Registry
Build the React frontend (with Cloud Run URL baked in)
Build and push the Docker image via Cloud Build
Run full Terraform apply (Cloud Run, GCS bucket, IAM, service accounts)
On first deploy: rebuild frontend with real URL and redeploy

When done, the script prints:

✅ Done!
   App    : https://ama-api-xxxx.run.app
   Health : https://ama-api-xxxx.run.app/api/health

Infrastructure Details

Resource	Details
Cloud Run (Gen 2)	2 CPU, 4 GB RAM, 0–3 instances, GCS FUSE mount
GCS Bucket	Stores all project data and generated assets
Artifact Registry	Docker image storage (`us-central1`)
Secret Manager	`gemini-api-key` secret (never in env vars)
Cloud Build	Builds Docker image on `e2-highcpu-8`
Region	`us-central1`
GCP Project	`geminiliveagent-489401`

Manual Terraform (Advanced)

cd AMA/infra

# Initialize
terraform init

# Preview changes
terraform plan

# Apply
terraform apply

Useful Commands

# View Cloud Run logs
gcloud run services logs read ama-api --region=us-central1

# Check health
curl https://your-cloud-run-url.run.app/api/health

# Update secret (rotate API key)
echo -n "new-key" | gcloud secrets versions add gemini-api-key --data-file=-

Project Structure

AMA/
├── backend/
│   ├── main.py              # FastAPI app entry point
│   ├── config.py            # Model names, env vars, paths
│   ├── storage.py           # File-based persistence (project data)
│   ├── jobs.py              # Async job queue for pipeline stages
│   ├── routes/
│   │   ├── projects.py      # Project CRUD
│   │   ├── pipeline.py      # Asset generation triggers
│   │   ├── live.py          # WebSocket: Gemini Live agent
│   │   ├── dream.py         # WebSocket: Dream node generation
│   │   ├── assets.py        # Asset library management
│   │   └── camera.py        # Camera stream subscription
│   └── pipeline/
│       ├── story.py         # Stage 1: PDF → story structure
│       ├── assets.py        # Stage 2: sprite generation
│       ├── background.py    # Stage 3: Veo video generation
│       └── tts.py           # Stage 4: narration audio
│
├── frontend/
│   └── src/
│       ├── pages/
│       │   ├── ProjectsPage.tsx     # Project list
│       │   ├── PipelinePage.tsx     # Node graph editor
│       │   └── TheaterPage.tsx      # Live story playback
│       └── components/
│           ├── pipeline/            # Node editor components
│           └── theater/             # Playback + live session
│
├── infra/                   # Terraform IaC (Cloud Run, GCS, IAM)
├── deploy.sh                # One-command deployment script
└── Dockerfile               # Multi-stage build (frontend + backend)

GCP proof

Built With

Google Gemini Live API — real-time audio + vision + tool-calling
Veo 3.1 — video background generation
Gemini 2.5 Flash TTS — character voice narration
React + Vite — frontend framework
XYFlow — visual node graph editor
Pixi.js — WebGL sprite rendering
FastAPI — async Python backend
Google Cloud Run — serverless container hosting
Terraform — infrastructure as code

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
backend		backend
frontend		frontend
images		images
infra		infra
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
deploy.sh		deploy.sh
upload-project.sh		upload-project.sh

Folders and files

Latest commit

History

Repository files navigation

AMA — Interactive Story Studio

Inspiration

System UI

What it does

1. Pipeline Editor (Authoring Tool)

2. Theater Mode (Runtime Experience)

Theater mode

How we built it

Architecture

Tech Stack

Key Design Decision: Two-Layer Architecture

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for AMA — Interactive Story Studio

Test Instructions

How to Run

Prerequisites

Local Development

Environment Variables

Using the App (Local)

How to Deploy

Prerequisites

One-Command Deploy

Infrastructure Details

Manual Terraform (Advanced)

Useful Commands

Project Structure

GCP proof

Built With

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages