SAGA Architure

Inspiration

Most AI storytelling products still feel transactional. You type into a box, get text back, and the experience ends there. They do not see, hear, speak, or remember. They do not feel like living worlds.

SAGA started from a different question: what if a story engine could behave like a creative universe instead of a text utility? We wanted an experience where prose, illustrations, narration, music, cinematic clips, voice conversation, and persistent memory all belonged to the same artifact. That became the guiding idea behind SAGA: a story should feel alive.

What it does

SAGA is a living multimodal story engine built for the Gemini Live Agent Challenge 2026 in the Creative Storyteller category.

It creates a single interleaved manuscript flow that can include:

prose generated with Gemini
inline illustrations generated with Imagen 4
cinematic scene clips generated with Veo 2
narration generated with Gemini TTS
ambient score generated with Lyria 2
a persistent world atlas and character archive
a Gemini Live voice co-author that listens, responds, and can autonomously trigger the next story section

The core differentiator is that SAGA does not split these outputs into separate tabs or disconnected tools. Text, images, audio, and video appear in one story timeline. The user can type a premise, direct the story with text, or speak to Gemini Live. When the live co-author understands the user’s intent, it emits a GENERATING: [direction] instruction, which automatically launches the next story movement. That makes SAGA feel agentic rather than chat-based.

SAGA also remembers. A user can return later and resume the same world with their story state, characters, locations, manuscript, and selected media restored.

How we built it

SAGA uses a layered Google AI and Google Cloud architecture.

Google AI models and APIs

Gemini 2.0 Flash as the primary story engine
Gemini 2.5 Flash as a fallback/supporting model path
Gemini Live API for bidirectional voice co-authoring
Imagen 4 (imagen-4.0-generate-001) for scene illustrations
Veo 2 (veo-2.0-generate-001) for cinematic clips
Gemini TTS (gemini-2.5-flash-preview-tts) for narration
Lyria 2 for ambient music generation

Google Cloud services

Cloud Run for backend and frontend hosting
Firestore for persistent sessions and return-state
Cloud Storage for generated media artifacts
Vertex AI for media-generation infrastructure paths
Secret Manager for API key management
Artifact Registry for container images

Application architecture

FastAPI backend with WebSocket streaming
Next.js 15 frontend for the cinematic manuscript experience
Google GenAI SDK (google-genai) as the primary Gemini integration layer
Google ADK wrapper to surface SAGA as a true agent with tools
Qdrant Cloud for vector memory and continuity
Zustand for client state
Framer Motion for interface transitions
WeasyPrint for Story Bible PDF export

The backend orchestrates story generation, media generation, character continuity, world extraction, and persistent session storage. The frontend renders everything inline in one manuscript instead of splitting the experience into separate pages.

Challenges we ran into

Getting the interleaved experience to feel fluid instead of turn-based required careful streaming decisions and background task orchestration.
Lyria needed a REST path workaround for this workflow due to SDK/proto limitations.
Gemini Live audio required careful PCM handling and scheduled browser playback to avoid broken or gapped voice output.
Media generation had to degrade gracefully so the story could continue even if one service failed or timed out.
Persistent world restoration needed to bring back not just text, but session settings, illustrations, narration, and music while avoiding unnecessary regeneration cost.
Deployment had a few real-world challenges around Cloud Run runtime configuration, service naming, and environment consistency across shells.

Accomplishments that we're proud of

We built a true multimodal manuscript where text, image, video, narration, and score can appear in one living flow.
Gemini Live acts as a voice co-author, not just speech-to-text. It listens, reasons, and autonomously triggers new story generation with GENERATING:.
SAGA persists a story world across sessions and brings the user back into that world with a cinematic welcome-back moment.
We created a Character Visual Bible system so recurring characters remain visually coherent across illustrations.
We built a live 3D world globe that evolves with the narrative.
We packaged the whole system as a Cloud Run deployable product with IaC, docs, architecture proof, and a polished submission narrative.

What we learned

A multimodal experience feels much stronger when all outputs are treated as one artifact instead of separate tools.
Gemini Live becomes dramatically more compelling when framed as an agentic creative collaborator rather than a voice UI layer.
Consistency matters as much as generation quality; character continuity and world memory make the product feel authored instead of random.
Graceful degradation is essential in a multi-model system. A story engine has to keep moving even when one media path fails.
Submission quality is not just about code. Clear architecture, deployability, demo choreography, and proof artifacts matter just as much.

What's next for SAGA

Shared worlds with multiple collaborators directing the same chronicle
A richer universe library and publishable story vault
Mobile-first companion experiences for live co-authoring
Stronger asset editing workflows for regenerating specific scenes without breaking continuity
More advanced world simulation, relationship tracking, and story memory tooling
A marketplace for exportable stories, story bibles, and cinematic artifacts

Technologies Used

Core stack

FastAPI
Next.js 15
Google GenAI SDK (google-genai)
Google ADK
Terraform
Qdrant Cloud
Zustand
Framer Motion
WeasyPrint

Third-party integrations

Pollinations.ai — fallback image generation when Imagen fails
FingerPrint.js concept — browser fingerprinting approach for no-login return-state restoration
Three.js r128 from cdnjs — 3D globe rendering

Hashtag

GeminiLiveAgentChallenge

Built With

buckets
cloud-run
firestore
gcp
gemini
imagen
live-voice
lyria
next.js
python
tts
typescript
veo
vertex
weasypdf

Updates

Shrey Amitkumar Patel started this project — Mar 15, 2026 11:15 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.

SAGA - "A Living World"