Inspiration

Most AI storytelling products still feel transactional. You type into a box, get text back, and the experience ends there. They do not see, hear, speak, or remember. They do not feel like living worlds.

SAGA started from a different question: what if a story engine could behave like a creative universe instead of a text utility? We wanted an experience where prose, illustrations, narration, music, cinematic clips, voice conversation, and persistent memory all belonged to the same artifact. That became the guiding idea behind SAGA: a story should feel alive.

What it does

SAGA is a living multimodal story engine built for the Gemini Live Agent Challenge 2026 in the Creative Storyteller category.

It creates a single interleaved manuscript flow that can include:

  • prose generated with Gemini
  • inline illustrations generated with Imagen 4
  • cinematic scene clips generated with Veo 2
  • narration generated with Gemini TTS
  • ambient score generated with Lyria 2
  • a persistent world atlas and character archive
  • a Gemini Live voice co-author that listens, responds, and can autonomously trigger the next story section

The core differentiator is that SAGA does not split these outputs into separate tabs or disconnected tools. Text, images, audio, and video appear in one story timeline. The user can type a premise, direct the story with text, or speak to Gemini Live. When the live co-author understands the user’s intent, it emits a GENERATING: [direction] instruction, which automatically launches the next story movement. That makes SAGA feel agentic rather than chat-based.

SAGA also remembers. A user can return later and resume the same world with their story state, characters, locations, manuscript, and selected media restored.

How we built it

SAGA uses a layered Google AI and Google Cloud architecture.

Google AI models and APIs

  • Gemini 2.0 Flash as the primary story engine
  • Gemini 2.5 Flash as a fallback/supporting model path
  • Gemini Live API for bidirectional voice co-authoring
  • Imagen 4 (imagen-4.0-generate-001) for scene illustrations
  • Veo 2 (veo-2.0-generate-001) for cinematic clips
  • Gemini TTS (gemini-2.5-flash-preview-tts) for narration
  • Lyria 2 for ambient music generation

Google Cloud services

  • Cloud Run for backend and frontend hosting
  • Firestore for persistent sessions and return-state
  • Cloud Storage for generated media artifacts
  • Vertex AI for media-generation infrastructure paths
  • Secret Manager for API key management
  • Artifact Registry for container images

Application architecture

  • FastAPI backend with WebSocket streaming
  • Next.js 15 frontend for the cinematic manuscript experience
  • Google GenAI SDK (google-genai) as the primary Gemini integration layer
  • Google ADK wrapper to surface SAGA as a true agent with tools
  • Qdrant Cloud for vector memory and continuity
  • Zustand for client state
  • Framer Motion for interface transitions
  • WeasyPrint for Story Bible PDF export

The backend orchestrates story generation, media generation, character continuity, world extraction, and persistent session storage. The frontend renders everything inline in one manuscript instead of splitting the experience into separate pages.

Challenges we ran into

  • Getting the interleaved experience to feel fluid instead of turn-based required careful streaming decisions and background task orchestration.
  • Lyria needed a REST path workaround for this workflow due to SDK/proto limitations.
  • Gemini Live audio required careful PCM handling and scheduled browser playback to avoid broken or gapped voice output.
  • Media generation had to degrade gracefully so the story could continue even if one service failed or timed out.
  • Persistent world restoration needed to bring back not just text, but session settings, illustrations, narration, and music while avoiding unnecessary regeneration cost.
  • Deployment had a few real-world challenges around Cloud Run runtime configuration, service naming, and environment consistency across shells.

Accomplishments that we're proud of

  • We built a true multimodal manuscript where text, image, video, narration, and score can appear in one living flow.
  • Gemini Live acts as a voice co-author, not just speech-to-text. It listens, reasons, and autonomously triggers new story generation with GENERATING:.
  • SAGA persists a story world across sessions and brings the user back into that world with a cinematic welcome-back moment.
  • We created a Character Visual Bible system so recurring characters remain visually coherent across illustrations.
  • We built a live 3D world globe that evolves with the narrative.
  • We packaged the whole system as a Cloud Run deployable product with IaC, docs, architecture proof, and a polished submission narrative.

What we learned

  • A multimodal experience feels much stronger when all outputs are treated as one artifact instead of separate tools.
  • Gemini Live becomes dramatically more compelling when framed as an agentic creative collaborator rather than a voice UI layer.
  • Consistency matters as much as generation quality; character continuity and world memory make the product feel authored instead of random.
  • Graceful degradation is essential in a multi-model system. A story engine has to keep moving even when one media path fails.
  • Submission quality is not just about code. Clear architecture, deployability, demo choreography, and proof artifacts matter just as much.

What's next for SAGA

  • Shared worlds with multiple collaborators directing the same chronicle
  • A richer universe library and publishable story vault
  • Mobile-first companion experiences for live co-authoring
  • Stronger asset editing workflows for regenerating specific scenes without breaking continuity
  • More advanced world simulation, relationship tracking, and story memory tooling
  • A marketplace for exportable stories, story bibles, and cinematic artifacts

Technologies Used

Core stack

  • FastAPI
  • Next.js 15
  • Google GenAI SDK (google-genai)
  • Google ADK
  • Terraform
  • Qdrant Cloud
  • Zustand
  • Framer Motion
  • WeasyPrint

Third-party integrations

  • Pollinations.ai — fallback image generation when Imagen fails
  • FingerPrint.js concept — browser fingerprinting approach for no-login return-state restoration
  • Three.js r128 from cdnjs — 3D globe rendering

Category

Creative Storyteller

Hashtag

GeminiLiveAgentChallenge

Built With

Share this project:

Updates