SceneNapse Studio

💡 Inspiration

Text-to-image prompting breaks down when results miss the mark: prompts are unstructured, edits are unpredictable, and “fixing lighting” often changes the subject or the mood entirely.

We wanted an agentic system that can disambiguate the components of a shot and prevent an LLM from rewriting components it shouldn’t.

To achieve this, we layered FIBO’s visual schema with cinematic shot structure from the ShotVL model, creating a framework where edits stay isolated and controllable.

⚙️ What It Does

Phase 1: Prompt Enhancement

A user’s prompt is transformed into a structured Scene Ontology with four independent “heads”:

Elements: who or what exists in the scene
Objects: appearance and attributes
Actions: motion and dynamics
Cinematography: shot size, camera angle, lighting, and mood

A critic agent checks each head for “lane separation.” If one leaks information into another, it regenerates until clean.

Phase 2: Generation & Validation

Generates multiple images using Nano Banana Pro
Scores them with JoyQuality, a SigLIP2-based pairwise encoder trained to assess aesthetic and technical quality; faster than ELO-style ranking
Uses VLM Guardrails (Gemini) to verify that the final image truly matches the structured prompt

Bonus: Integrates Freepik semantic search for visual references and supports voice control through Gemini’s Live API.

🛠️ How We Built It

Backend: Python + FastAPI for scene generation, refinement, and head-specific updates
Agentic logic: Implemented in D.S. Pie (DSPy) pipelines with distinct signatures per head
Quality selection: Pairwise encoder scoring (JoyQuality) for efficient best-pick logic
Frontend: Next.js UI for visualization, structured editing, and streaming image generation
Multimodal enrichment: Gemini Pro 3 for grounding shot elements from images and audio

🧩 Challenges

Keeping lane separation strict requires a lot of iterations of prompt compilation via DSPy.
Refinement classifier based on user request also demands careful prompt compilation.
Maintaining speed and quality long term (project needs more time and data to self-refine).

🏆 Accomplishments

A working four-head cinematic schema that makes prompts modular and explainable
Real-time generation with pairwise quality evaluation
A full-stack prototype: UI + API + multimodal control loop

🧠 Lessons Learned

“Prompt drift” is a systems-design issue; structure, modularity, and guardrails matter as much as model performance
Pairwise-trained encoders are game changers for fast iteration
Multimodal models shine as validators and contextual enrichers, not just generators

🔮 What’s Next

Expand cinematography control with deeper ShotVL-style attributes (lens, framing, composition)
Add per-head failure explanations for transparency
Create reusable “shot templates” and personal style packs
Integrate storyboard and audio mood inputs for richer multimodal scenes

🧰 Built With

Python • FastAPI • D.S. Pie (DSPy) • Next.js • TypeScript • Gemini (LLM + Image) • Gemini Live API • JoyQuality • SigLIP2 • FIBO Schema • Freepik API • SSE Streaming

🌐 Links

GitHub: https://github.com/dataphysician/scenenapse

Built With

fal
gemini
nano-banana-pro
python
siglip2

Updates

Patrick Damaso MD started this project — Dec 15, 2025 06:04 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.