Videre
Inspiration
Video editing today forces creators to choose between privacy, cost, and connectivity. Professional tools like Adobe Creative Cloud or DaVinci Resolve Studio are expensive, and cloud AI features require uploading sensitive footage or depend on stable internet.
We wanted to change that. Videre is built for:
- Independent creators who can’t afford enterprise subscriptions
- Professionals working with confidential content (medical, legal, corporate)
- Users in low-bandwidth or offline environments where cloud AI isn’t usable
Our goal was a desktop video editor that runs fully offline, uses local AI for transcription and retrieval, and lets you edit by simply selecting or deleting text in the transcript. We took inspiration from transcript-based editing tools but focused on making everything run on your own hardware—including Qualcomm Snapdragon NPUs—without ever sending data to the cloud.
What We Built
1. Transcript-based video editing
- Import videos and audio into a timeline with tracks and scrubbers
- Transcribe clips with word-level timestamps using Whisper (CPU/GPU or Qualcomm NPU)
- Edit by editing the transcript: delete or change text, and the corresponding video segments are cut or split automatically
- Sync transcript ↔ video: word-level timing powers precise edits and caption placement
2. AI-generated captions
- Generate burned-in captions from the transcript, synced to word timings
- Customize font, color, alignment, weight via the Text panel
- Caption track integrated with the Remotion composition for export
3. B‑roll retrieval with SigLIP2
- Embed images and videos using
google/siglip2-base-patch16-224(Hugging Face) - Each image gets one embedding; each video is split into 4–5 time sections and embedded
- Text-based retrieval: describe what you want (e.g., "a skateboard", "food") and find matching assets from your
assets/folder - Auto-suggest from transcript: retrieval panel can use the current clip transcript as a search query to suggest B‑roll
4. On-device Whisper
- Legacy mode:
faster-whisperor PyTorch Whisper (CPU/CUDA) - Qualcomm NPU mode (default): Whisper-Small via pre-compiled ONNX/QNN models from Qualcomm AI Hub
- Runs through a Python subprocess; no external API calls
5. Python labs for experimentation
- nexa-caption-lab: Nexa SDK transcription with optional NPU support
- nexa-video-context-lab: Qwen3‑VL for scene context—find timestamp ranges by natural-language query (e.g., "person walks to whiteboard"), clip export—usable as a standalone CLI
How We Built It
Architecture
- Frontend: React 19 + React Router, Tailwind, Radix UI, Framer Motion
- Desktop: Electron (no Tauri; we switched to Electron for compatibility)
- Backend: Node.js/Express serving uploads, media, transcription, and retrieval
- Video rendering: Remotion + FFmpeg
- Python subprocess: Transcription (Whisper / NPU) and embeddings (SigLIP2) run as sidecar processes—no separate server
AI / ML stack
| Task | Model / System | Notes |
|---|---|---|
| Speech transcription | Whisper (openai/whisper-small) | faster-whisper or transformers fallback |
| NPU transcription | Whisper-Small ONNX/QNN (Qualcomm AI Hub) | Via nexa-caption-lab |
| Image/video embeddings | SigLIP2 (google/siglip2-base-patch16-224) | Hugging Face Transformers |
| Scene context (lab) | Qwen3‑VL‑4B (nexa-video-context-lab) | Standalone CLI only |
Data flow
- User imports media → stored in
out/<project-id>/ - User transcribes selected clips → Whisper runs via Python subprocess, returns word timestamps
- User edits transcript → segments mapped to time ranges; timeline is cut/split
- User generates captions → Remotion composition renders with burned-in text
- User searches for B‑roll → SigLIP2 embeddings queried; matching assets from
assets/returned
Challenges We Faced
Qualcomm NPU integration
Setting up ONNX/QNN models and nexa-caption-lab’s NPU path required careful handling of pre-compiled binaries, context binaries, and Python envs. We added a fallback to legacy Whisper when NPU isn’t available.Transcript-to-segment alignment
Mapping edited transcript text back to Whisper’s word timestamps for cutting clips was non-trivial. We tokenize and align edited text with original words to produce segments with correct start/end times.Video section embeddings
We split videos into temporal sections, sample frames per section, average their SigLIP2 embeddings, and store them with segment IDs likevideo.mp4#0.0-4.0so retrieval can target specific time ranges.Cross-platform and ARM64
On Windows ARM64, Remotion’s default Chrome Headless Shell wasn’t usable; we added support for an x64 Chrome Headless Shell under emulation for rendering.
Tech Stack
- Frontend: React 19, React Router, Tailwind CSS, Radix UI, Framer Motion, Remotion
- Desktop: Electron
- Backend: Node.js, Express
- Video: FFmpeg, Remotion
- AI / ML: Hugging Face Transformers, SigLIP2, Whisper (faster-whisper / PyTorch), ONNX Runtime QNN (Qualcomm)
- Python: transformers, PyTorch, faster-whisper, opencv-python, nexa-caption-lab
Log in or sign up for Devpost to join the conversation.