Multi-head DSPy pipeline for cinematic scene composition and text-to-image generation with quality scoring and prompt alignment verification.
SceneNapse operates in two phases: Prompt Enhancement transforms text into structured scene descriptions, then Image Generation & Validation produces and verifies images against the prompt.
User prompts (with optional multimodal inputs) are decomposed into a structured Scene Ontology with four parallel heads: Objects, Actions, and Cinematography—all grounded in the Elements layer.
The enriched scene description drives image generation (Nano Banana Pro), with JoyQuality scoring candidates and VLM Guardrails verifying prompt alignment.
SceneNapse Studio transforms simple text prompts into structured, composable scene descriptions using a 3-stage DSPy pipeline. Each scene is decomposed into four independent "heads" (Elements, Objects, Actions, Cinematography) that can be refined individually or regenerated together.
This architecture enables:
- Precise control over visual elements, motion, and cinematography
- Targeted refinements that only regenerate affected components
- Lane separation preventing content leakage between semantic domains
- Quality assurance via JoyQuality scoring and VLM Guardrails
Each head uses standard cinematic terminology and owns its domain exclusively.
| Head | Purpose | Example Fields |
|---|---|---|
| Elements | WHO/WHAT exists in the scene | element_id, role, entity_type, importance |
| Objects | Visual appearance & physical details | description, shape_and_color, texture, pose |
| Actions | Physical motion & dynamics | action_class, stage_class, temporal_context |
| Cinematography | Camera, lighting, mood, style | shot_size, camera_angle, lighting, artistic_style |
| Category | Options |
|---|---|
| Shot Size | extreme_close_up, close_up, medium_close_up, medium, medium_long, long, extreme_long |
| Camera Angle | eye_level, low_angle, high_angle, dutch_angle, birds_eye, worms_eye |
| Lighting | soft natural daylight, harsh midday sun, golden hour warmth, cool moonlight, neon city lights |
| Artistic Style | photorealistic, hyperrealistic, impressionistic, noir, anime, surrealist, minimalist |
Each head owns its domain exclusively, preventing content from leaking into the wrong component:
| Content Type | Belongs In | NOT In |
|---|---|---|
| "woman in red dress" | Objects | Elements |
| "runs gracefully" | Actions | Objects |
| "dramatic lighting" | Cinematography | Actions |
| "romantic mood" | Cinematography | Objects |
The Critic validates lane separation and scores consistency (0.0-1.0). Scenes below the threshold (0.85) are automatically regenerated.
Pairwise preference finetuned encoder-only model for image quality scoring.
| Property | Value |
|---|---|
| GitHub | https://github.com/fpgaminer/joyquality |
| Model | fancyfeast/joyquality-siglip2-so400m-512-16-o8eg1n4c |
| Base | Google SigLIP2-so400m-patch14-384 vision encoder |
| Training | Pairwise preference finetuning on aesthetic/technical quality |
| Output | Quality score 0-1 (sigmoid normalized) |
JoyQuality uses a 400M parameter encoder-only architecture trained on pairwise human preferences to predict image quality without needing a text prompt. It evaluates:
- Aesthetic quality - composition, color harmony, visual appeal
- Technical quality - sharpness, noise, artifacts, exposure
# Score a single image
score = selector.score_image(pil_image) # Returns 0.0-1.0
# Select best from batch
best_image, best_idx, best_score = selector.select_best(images)
# Rank all candidates
ranked = selector.rank_images(images) # [(idx, score), ...] sorted descImages are scored as they stream in from generation, enabling early stopping when a high-quality candidate is found.
Multimodal LLM verification ensuring generated images align with the structured scene prompt.
Each component returns a binary score (0 = fail, 1 = pass):
| Component | Question | Checks |
|---|---|---|
| VerifyElements | Are all elements present? | Characters, objects, settings visible in image |
| VerifyObjects | Do descriptions match? | Visual appearance, colors, textures, pose accuracy |
| VerifyActions | Are actions visible? | Motion, dynamics, temporal context implied |
| VerifyCinematography | Does camera/lighting match? | Shot size, angle, lighting, composition, style |
- Component scores: 0 (fail) or 1 (pass) each
- Total score: 0-4 (sum of components)
- Pass criteria: All 4 components must pass (score = 4)
ImageVerificationResult(
passed=True, # All components passed
total_score=4, # 0-4
elements_score=1, # Binary: 0 or 1
objects_score=1,
actions_score=1,
cinematography_score=1,
missing_elements=[], # Element IDs not found
critical_issues=[], # Why components failed
suggestions=[], # How to improve
)When verification fails, the system provides:
- Critical issues: Which components failed and why
- Suggestions: Specific improvements (e.g., "Adjust camera angle to match specification")
- Missing elements: Which scene elements weren't rendered
| Technology | Purpose | Link |
|---|---|---|
| DSPy | Programming LLM modules (not prompting) | dspy.ai |
| JoyQuality | Pairwise preference encoder for image quality | GitHub HuggingFace |
| SigLIP2 | Google's vision encoder (JoyQuality base) | arXiv:2502.14786 |
| Nano Banana Pro | Streaming image generation | - |
| Gemini | LLM backend for DSPy and VLM Guardrails | Google AI |
| Gemini Live API | Real-time voice control with function calling | Live API Docs |
| Freepik | Reference image search | Freepik API |
- Python 3.10+
- Node.js 18+
- uv (Python package manager)
export GOOGLE_API_KEY="your-gemini-api-key"
export FAL_KEY="your-fal-api-key"
export FREEPIK_API_KEY="your-freepik-api-key" # Optional, for reference images# Clone the repository
git clone https://github.com/dataphysician/scenenapse.git
cd scenenapse
# Install Python dependencies
uv sync
# Install frontend dependencies
cd frontend && npm installBackend (FastAPI on port 8000):
uv run python main.pyFrontend (Next.js on port 3000):
cd frontend && npm run devThen open http://localhost:3000 in your browser.
Control SceneNapse with voice using the Gemini Live API:
# Requires pyaudio - install system dependencies first:
# Ubuntu/Debian: sudo apt-get install portaudio19-dev
# macOS: brew install portaudio
# Start the backend first, then run:
uv run python -m backend.voice_assistantVoice commands:
- "Search for images of [query]" - Find reference images
- "Select reference image [number]" - Pick a reference
- "Create a scene about [description]" - Generate scene
- "Refine the scene to [instruction]" - Modify scene
- "Generate images" - Create images from scene
- "Select image [number]" - Pick generated image
- "What's the status?" - Check current state
- "Clear everything" - Start over
Note: Use headphones to prevent echo feedback.
scenenapse/
├── api/ # FastAPI routes and state management
│ ├── main.py # App entry point with CORS config
│ ├── routes.py # API endpoints
│ ├── state.py # In-memory scene state
│ └── schemas.py # Pydantic models
├── backend/ # DSPy pipeline and signatures
│ ├── dspy_pipeline.py # 3-stage generation pipeline
│ ├── dspy_refinement.py # Smart refinement routing
│ ├── dspy_signatures.py # DSPy signatures for each head
│ ├── dspy_guardrails.py # VLM image-prompt alignment verification
│ ├── joy_quality.py # SigLIP2 pairwise preference quality scoring
│ └── voice_assistant.py # Gemini Live API voice control
├── frontend/ # Next.js UI
│ ├── app/page.tsx # Main application component
│ ├── app/api/ # Next.js API routes (proxy to backend)
│ └── lib/types.ts # TypeScript type definitions
├── main.py # Backend entry point
└── pyproject.toml # Python dependencies
| Endpoint | Method | Description |
|---|---|---|
/api/generate |
POST | Generate complete scene from prompt |
/api/refine |
POST | Refine existing scene with instruction |
/api/assemble |
POST | Validate manually-edited scene |
| Endpoint | Method | Description |
|---|---|---|
/api/generate-images-stream |
GET | Stream generated images (SSE) with JoyQuality scores |
/api/search-images |
POST | Search reference images via Freepik |
/api/select-reference |
POST | Select reference image for generation |
/api/select-generated |
POST | Select generated image |
| Endpoint | Method | Description |
|---|---|---|
/api/scene |
GET | Get current scene state |
/api/scene |
DELETE | Clear scene state |
/api/scene/{head} |
PUT | Update specific head (elements/objects/actions/cinematography) |
/api/scene/assembled |
GET | Get finalized scene with summary |

