SceneNapse Studio
đĄ Inspiration
Text-to-image prompting breaks down when results miss the mark: prompts are unstructured, edits are unpredictable, and âfixing lightingâ often changes the subject or the mood entirely.
We wanted an agentic system that can disambiguate the components of a shot and prevent an LLM from rewriting components it shouldnât.
To achieve this, we layered FIBOâs visual schema with cinematic shot structure from the ShotVL model, creating a framework where edits stay isolated and controllable.
âď¸ What It Does
Phase 1: Prompt Enhancement
A userâs prompt is transformed into a structured Scene Ontology with four independent âheadsâ:
- Elements: who or what exists in the scene
- Objects: appearance and attributes
- Actions: motion and dynamics
- Cinematography: shot size, camera angle, lighting, and mood
A critic agent checks each head for âlane separation.â If one leaks information into another, it regenerates until clean.
Phase 2: Generation & Validation
- Generates multiple images using Nano Banana Pro
- Scores them with JoyQuality, a SigLIP2-based pairwise encoder trained to assess aesthetic and technical quality; faster than ELO-style ranking
- Uses VLM Guardrails (Gemini) to verify that the final image truly matches the structured prompt
Bonus: Integrates Freepik semantic search for visual references and supports voice control through Geminiâs Live API.
đ ď¸ How We Built It
- Backend: Python + FastAPI for scene generation, refinement, and head-specific updates
- Agentic logic: Implemented in D.S. Pie (DSPy) pipelines with distinct signatures per head
- Quality selection: Pairwise encoder scoring (JoyQuality) for efficient best-pick logic
- Frontend: Next.js UI for visualization, structured editing, and streaming image generation
- Multimodal enrichment: Gemini Pro 3 for grounding shot elements from images and audio
đ§Š Challenges
- Keeping lane separation strict requires a lot of iterations of prompt compilation via DSPy.
- Refinement classifier based on user request also demands careful prompt compilation.
- Maintaining speed and quality long term (project needs more time and data to self-refine).
đ Accomplishments
- A working four-head cinematic schema that makes prompts modular and explainable
- Real-time generation with pairwise quality evaluation
- A full-stack prototype: UI + API + multimodal control loop
đ§ Lessons Learned
- âPrompt driftâ is a systems-design issue; structure, modularity, and guardrails matter as much as model performance
- Pairwise-trained encoders are game changers for fast iteration
- Multimodal models shine as validators and contextual enrichers, not just generators
đŽ Whatâs Next
- Expand cinematography control with deeper ShotVL-style attributes (lens, framing, composition)
- Add per-head failure explanations for transparency
- Create reusable âshot templatesâ and personal style packs
- Integrate storyboard and audio mood inputs for richer multimodal scenes
đ§° Built With
Python ⢠FastAPI ⢠D.S. Pie (DSPy) ⢠Next.js ⢠TypeScript ⢠Gemini (LLM + Image) ⢠Gemini Live API ⢠JoyQuality ⢠SigLIP2 ⢠FIBO Schema ⢠Freepik API ⢠SSE Streaming
đ Links
GitHub: https://github.com/dataphysician/scenenapse
Log in or sign up for Devpost to join the conversation.