Hackathon Submission · Gemini Live Agent Challenge 2026 AMA is a node-based story platform using Gemini. Creators build branching AI narratives with voice and images—no coding needed. Audiences speak to characters to shape any genre in real time.
Every child deserves a story that feels alive — one that listens, responds, and grows with them.
We wanted to give children a truly immersive reading experience, whether the book already has illustrations or is just plain text. With AI, any story can become a rich, visual, and interactive world tailored to each child's imagination.
The spark came from the Gemini Live API and its interleaved generation capability. A storybook doesn't have to be static — characters can speak, react to what a child says, and branch the narrative in real time. A child's response becomes part of the story. Every read-through is a unique adventure.
But building that kind of experience is hard. Generating AI content requires orchestrating dozens of API calls — image generation, video, speech, live streaming — and there's no good tool to tie it all together. So we built one: a studio where creators can upload any book, generate all the multimedia assets they need, and wire everything into an interactive story graph — without writing a single line of code.
AMA has two layers that work together: a Pipeline Editor for creators to build stories, and a Theater Mode for audiences to experience them.
A web-based visual node editor where creators build interactive stories — no coding required:
- Start from anything — upload a PDF or use the AI studio interface to describe a story; Gemini reads and understands it automatically and generates the initial story graph
- Generate multimedia assets per node: character sprites, animated backgrounds (video via Veo), and narration audio — all powered by Google's generative AI models
- Build a branching story graph with drag-and-drop nodes and edges
- Centralized asset management — every node is an independent workstation where you create and manage its assets; all assets are shared across the project so you can reuse them freely
- Character art consistency — image generation uses Gemini's image reference ability, so characters look the same across every node and scene
The story graph has four node types, each a distinct creative tool:
The audience-facing stage where the story comes alive:
- Background videos play on loop (generated by Veo)
- Character sprites animate with emotional states (idle, happy, surprised, and more)
- Narration audio plays synchronized with visuals
- Gemini Live agent listens to and watches the audience, speaking as the story character in real time
- Branching navigation — Gemini decides which story branch to take based on what the audience says or does
- Dream sequences — the audience describes something imaginary; Gemini generates a live storyboard and narrates it as images stream in
- image with tts
- live node: theater mode with live api listen to audience
| Layer | Technology |
|---|---|
| Frontend | React 19.2, TypeScript, Vite, TailwindCSS 4 |
| Node Graph Editor | XYFlow 12 |
| Sprite Rendering | Pixi.js 8 (WebGL) |
| Backend | FastAPI 0.115, Python 3.12, asyncio |
| Package Manager | uv (fast Python packaging) |
| Story Understanding | Gemini 3 Flash |
| Image Generation | Gemini 3 Pro Image |
| Video Generation | Veo 3.1 |
| Text-to-Speech | Gemini 2.5 Flash TTS |
| Live Real-Time | Gemini 2.5 Flash Live (audio + vision) |
| Infra | Google Cloud Run (Gen 2), GCS, Terraform, Cloud Build |
| Background Removal | rembg (ONNX, baked into Docker image) |
We made a deliberate choice to separate authoring (pipeline) from runtime (live agent):
- Pipeline stage: all expensive generative assets (video, sprites, audio) are pre-generated and stored, so the child never waits for generation
- Live stage: Gemini only handles real-time conversation and navigation decisions — everything else is instant playback
This gives us zero-latency narrative — the AI doesn't think about the plot, only about how to react to the child.
1. Gemini's interleaved multimodal output is fragile — we had to prove it first
Before building Dream Nodes, we ran a dedicated experiment series to validate what the Gemini API actually supports versus what the documentation claims. The results were sobering:
- TEXT + IMAGE interleaving works — but only with careful prompting. In multi-turn chat, once image history accumulates, the model silently drops all text output and returns images only. Sending more than one reference image on the first turn triggers the same failure. Fix: explicitly instruct
"Respond with the scene text first, then the illustration."in every prompt, and switch from streaming to non-streaming calls for multi-image turns. - TEXT + VIDEO in a single call is impossible.
"VIDEO"is not a validresponse_modalitiesvalue — the API returns a hard 400 immediately. - TEXT + AUDIO interleaving causes hallucination. The model understood it should produce audio, but invented a fake
<tool_call>TTS function and returned zero audio bytes. It knows what to do but fakes how to do it.
The conclusion: Gemini's claim of unified TEXT + VIDEO + AUDIO in one call is inaccurate. We had to abandon the idea of a single omnibus generation call and instead orchestrate separate purpose-built APIs in sequence — Flash for text, Pro Image for illustration, Veo for video, and the TTS pipeline for audio. This experiment report directly shaped our two-layer pipeline architecture.
2. Gemini SDK blocking the event loop
The Google GenAI SDK makes synchronous blocking calls internally. Wrapping them naively inside FastAPI's async routes deadlocked the server. We fixed this with asyncio.to_thread() to offload blocking calls without freezing the event loop.
3. Veo 3 rate limits (2 RPM, 10 RPD) Video generation is powerful but slow and heavily rate-limited. We built an asset versioning and reuse system so creators never have to regenerate a background unless they intentionally want to — the version picker lets them choose the best generated video across multiple runs.
4. Character art consistency with reference images More reference images do not help — they hurt. Our experiments showed that sending 3 reference images caused the model to hallucinate character details and drop text output entirely. We settled on a single reference image per generation call combined with explicit style instructions to keep characters visually consistent across scenes.
5. Character persona consistency across branching When a child navigates to a different story branch, how does Gemini maintain the character's personality and memory? We designed a "Context Packet" — a dynamic summary of the character's journey injected into the system prompt on every page transition. This remains an active area of refinement.
6. Per-page asset versioning complexity
Our initial storage model used global version counters per character. As story branching was added, this broke — a sprite version relevant on page 4 might not be correct for page 7. We redesigned to per-page, per-asset version tracking stored in meta.json.
7. Background removal cold start
rembg downloads a ~87MB ONNX model on first use. On Cloud Run with a cold start, this caused 30+ second delays. We baked the model into the Docker image so it is always available from the first request.
- Full end-to-end pipeline: upload a PDF → generate assets → build a branching story → play it live with a real-time AI character. It all works.
- Dream Nodes: a child says "I have a pet dragon" and within ~30 seconds Gemini has generated a 3-scene storyboard showing the dragon's adventure, narrating it live while images stream in. This felt genuinely magical in testing.
- Zero-latency playback: by pre-generating all assets in the pipeline stage, the theater experience has no loading screens or generation pauses — page transitions are instant.
- Production-grade infrastructure: Terraform IaC, Secret Manager, GCS FUSE mount, Cloud Build CI/CD, and Cloud Run Gen 2 — the whole stack is reproducible with a single
./deploy.shcommand. - Visual node editor: the XYFlow-based pipeline editor lets non-technical storytellers see their story structure as a graph and modify it visually — no JSON editing required.
- Gemini Live API is more than a voice chat — the combination of native audio, real-time vision, and tool-calling makes it a genuine AI director capable of understanding context and making decisions, not just responding to prompts.
- Pre-generation is the right split point — separating deterministic asset generation from real-time agent behavior is the key to a smooth experience. Trying to generate assets during live interaction kills immersion.
- Rate limits shape product design — Veo's 10 requests/day limit forced us to think about asset reuse as a first-class feature, which actually made the authoring tool better.
- Children as users require different trust models — the AI "hallucinating" story details is actually fine (it adds to the magic). The standard "AI mistakes are costly" assumption doesn't apply when the domain is creative play.
- Infrastructure-first matters for hackathons too — having Terraform and
deploy.shset up early meant we never lost time to deployment issues during crunch time.
Near term:
- Better persona consistency — implement the full "Context Packet" system so Gemini carries rich character memory across all branches of the story graph
- Faster Dream Nodes — reduce the ~30s wait for imagination generation using streaming interleaved output and background preprocessing
- Physical projection mode — connect to a projector and add OpenCV hand-tracking so children can "touch" characters on the wall
Medium term:
- Music generation — integrate Lyria (Google's music generation model) to generate adaptive background music that responds to story mood
- Text-to-story ingestion — allow creators to paste plain text (not just PDFs) to bootstrap a story graph automatically
- Multiplayer — support multiple children in the same room, with Gemini managing a group dynamic instead of one-on-one
Longer term:
- Projector + camera calibration — full spatial mapping so characters can be projected to specific locations on a wall and respond to the child's physical position
- Any-book generalization — point a camera at any physical picture book; Gemini extracts characters and builds a live story graph on the fly (the original LuminaPages vision)
- Model routing — use smaller/cheaper models (Flash) for simple narration, reserve more capable models for complex branching decisions, reducing cost per session
You can try AMA live without any local setup:
- Navigate to the app: https://ama-api-525glmjwra-uc.a.run.app/
- Create a new project or open the demo project directly: https://ama-api-525glmjwra-uc.a.run.app/pipeline/20260314_201134_untitled
- Enter Theater Mode using the button in the top-left corner of the Pipeline Editor.
- Testing the Dream Node: The Dream Node uses real-time camera input. Before entering Theater Mode, open the camera page and select the same project ID (
20260314_201134_untitled) so your camera stream is sent to the server: https://ama-api-525glmjwra-uc.a.run.app/camera
The camera page must have the correct project ID selected to route your video feed into the Dream Node during the theater experience.
- Python 3.12+
- Node.js 18+
- A Gemini API key
Backend:
cd AMA/backend
# Copy environment template and add your API key
cp .env.example .env
# Edit .env: set GEMINI_API_KEY=your_key_here
# Install dependencies with uv (fast Python package manager)
pip install uv
uv sync
# Start the server (mock mode on by default — no real API calls)
uv run uvicorn main:app --reload --port 8000The backend will be available at http://localhost:8000.
Frontend:
cd AMA/frontend
npm install
npm run devThe app will be available at http://localhost:5173. The dev server proxies API calls to localhost:8000.
Create AMA/backend/.env:
GEMINI_API_KEY=your_gemini_api_key_here
# Mock mode (default: true) — uses test fixtures, no real API calls
# Set to false to use real Gemini/Veo models
MOCK_MODE=true
# Vertex AI (optional) — use Vertex AI instead of Gemini API directly
GOOGLE_GENAI_USE_VERTEXAI=false
# GOOGLE_CLOUD_PROJECT=your-gcp-project-id
# GOOGLE_CLOUD_LOCATION=us-central1Start with
MOCK_MODE=trueto explore the app without spending API credits. Switch tofalsewhen you're ready to generate real assets.
- Open
http://localhost:5173 - Create a new project or upload a PDF storybook
- In the Pipeline Editor, run each stage in order:
- Story Understanding → extracts pages, characters, and structure
- Sprite Generation → creates character PNG sprites
- Background Generation → generates video loops (watch Veo rate limits!)
- TTS Narration → generates audio for each page
- Click on any story node to edit its assets, script, and characters
- Add edges between nodes to create branching paths
- Switch to Theater Mode and select a starting node to play
# Install tools
brew install terraform gcloud
# Authenticate with Google Cloud
gcloud auth login
gcloud auth application-default login
gcloud config set project <name>cd AMA
# Set your API key (never commit this to git)
export GEMINI_API_KEY="your-gemini-api-key"
# Deploy everything
./deploy.shThe script automates:
- Push
GEMINI_API_KEYto Google Secret Manager - Run Terraform to enable GCP APIs and create Artifact Registry
- Build the React frontend (with Cloud Run URL baked in)
- Build and push the Docker image via Cloud Build
- Run full Terraform apply (Cloud Run, GCS bucket, IAM, service accounts)
- On first deploy: rebuild frontend with real URL and redeploy
When done, the script prints:
✅ Done!
App : https://ama-api-xxxx.run.app
Health : https://ama-api-xxxx.run.app/api/health
| Resource | Details |
|---|---|
| Cloud Run (Gen 2) | 2 CPU, 4 GB RAM, 0–3 instances, GCS FUSE mount |
| GCS Bucket | Stores all project data and generated assets |
| Artifact Registry | Docker image storage (us-central1) |
| Secret Manager | gemini-api-key secret (never in env vars) |
| Cloud Build | Builds Docker image on e2-highcpu-8 |
| Region | us-central1 |
| GCP Project | geminiliveagent-489401 |
cd AMA/infra
# Initialize
terraform init
# Preview changes
terraform plan
# Apply
terraform apply# View Cloud Run logs
gcloud run services logs read ama-api --region=us-central1
# Check health
curl https://your-cloud-run-url.run.app/api/health
# Update secret (rotate API key)
echo -n "new-key" | gcloud secrets versions add gemini-api-key --data-file=-AMA/
├── backend/
│ ├── main.py # FastAPI app entry point
│ ├── config.py # Model names, env vars, paths
│ ├── storage.py # File-based persistence (project data)
│ ├── jobs.py # Async job queue for pipeline stages
│ ├── routes/
│ │ ├── projects.py # Project CRUD
│ │ ├── pipeline.py # Asset generation triggers
│ │ ├── live.py # WebSocket: Gemini Live agent
│ │ ├── dream.py # WebSocket: Dream node generation
│ │ ├── assets.py # Asset library management
│ │ └── camera.py # Camera stream subscription
│ └── pipeline/
│ ├── story.py # Stage 1: PDF → story structure
│ ├── assets.py # Stage 2: sprite generation
│ ├── background.py # Stage 3: Veo video generation
│ └── tts.py # Stage 4: narration audio
│
├── frontend/
│ └── src/
│ ├── pages/
│ │ ├── ProjectsPage.tsx # Project list
│ │ ├── PipelinePage.tsx # Node graph editor
│ │ └── TheaterPage.tsx # Live story playback
│ └── components/
│ ├── pipeline/ # Node editor components
│ └── theater/ # Playback + live session
│
├── infra/ # Terraform IaC (Cloud Run, GCS, IAM)
├── deploy.sh # One-command deployment script
└── Dockerfile # Multi-stage build (frontend + backend)
- Google Gemini Live API — real-time audio + vision + tool-calling
- Veo 3.1 — video background generation
- Gemini 2.5 Flash TTS — character voice narration
- React + Vite — frontend framework
- XYFlow — visual node graph editor
- Pixi.js — WebGL sprite rendering
- FastAPI — async Python backend
- Google Cloud Run — serverless container hosting
- Terraform — infrastructure as code










