SceneNapse Studio

💡 Inspiration

Text-to-image prompting breaks down when results miss the mark: prompts are unstructured, edits are unpredictable, and “fixing lighting” often changes the subject or the mood entirely.

We wanted an agentic system that can disambiguate the components of a shot and prevent an LLM from rewriting components it shouldn’t.

To achieve this, we layered FIBO’s visual schema with cinematic shot structure from the ShotVL model, creating a framework where edits stay isolated and controllable.


⚙️ What It Does

Phase 1: Prompt Enhancement

A user’s prompt is transformed into a structured Scene Ontology with four independent “heads”:

  • Elements: who or what exists in the scene
  • Objects: appearance and attributes
  • Actions: motion and dynamics
  • Cinematography: shot size, camera angle, lighting, and mood

A critic agent checks each head for “lane separation.” If one leaks information into another, it regenerates until clean.

Phase 2: Generation & Validation

  • Generates multiple images using Nano Banana Pro
  • Scores them with JoyQuality, a SigLIP2-based pairwise encoder trained to assess aesthetic and technical quality; faster than ELO-style ranking
  • Uses VLM Guardrails (Gemini) to verify that the final image truly matches the structured prompt

Bonus: Integrates Freepik semantic search for visual references and supports voice control through Gemini’s Live API.


🛠️ How We Built It

  • Backend: Python + FastAPI for scene generation, refinement, and head-specific updates
  • Agentic logic: Implemented in D.S. Pie (DSPy) pipelines with distinct signatures per head
  • Quality selection: Pairwise encoder scoring (JoyQuality) for efficient best-pick logic
  • Frontend: Next.js UI for visualization, structured editing, and streaming image generation
  • Multimodal enrichment: Gemini Pro 3 for grounding shot elements from images and audio

🧩 Challenges

  • Keeping lane separation strict requires a lot of iterations of prompt compilation via DSPy.
  • Refinement classifier based on user request also demands careful prompt compilation.
  • Maintaining speed and quality long term (project needs more time and data to self-refine).

🏆 Accomplishments

  • A working four-head cinematic schema that makes prompts modular and explainable
  • Real-time generation with pairwise quality evaluation
  • A full-stack prototype: UI + API + multimodal control loop

🧠 Lessons Learned

  • “Prompt drift” is a systems-design issue; structure, modularity, and guardrails matter as much as model performance
  • Pairwise-trained encoders are game changers for fast iteration
  • Multimodal models shine as validators and contextual enrichers, not just generators

🔮 What’s Next

  • Expand cinematography control with deeper ShotVL-style attributes (lens, framing, composition)
  • Add per-head failure explanations for transparency
  • Create reusable “shot templates” and personal style packs
  • Integrate storyboard and audio mood inputs for richer multimodal scenes

🧰 Built With

Python • FastAPI • D.S. Pie (DSPy) • Next.js • TypeScript • Gemini (LLM + Image) • Gemini Live API • JoyQuality • SigLIP2 • FIBO Schema • Freepik API • SSE Streaming


🌐 Links

GitHub: https://github.com/dataphysician/scenenapse


Built With

Share this project:

Updates