Videre

Inspiration

Video editing today forces creators to choose between privacy, cost, and connectivity. Professional tools like Adobe Creative Cloud or DaVinci Resolve Studio are expensive, and cloud AI features require uploading sensitive footage or depend on stable internet.

We wanted to change that. Videre is built for:

Independent creators who can’t afford enterprise subscriptions
Professionals working with confidential content (medical, legal, corporate)
Users in low-bandwidth or offline environments where cloud AI isn’t usable

Our goal was a desktop video editor that runs fully offline, uses local AI for transcription and retrieval, and lets you edit by simply selecting or deleting text in the transcript. We took inspiration from transcript-based editing tools but focused on making everything run on your own hardware—including Qualcomm Snapdragon NPUs—without ever sending data to the cloud.

What We Built

1. Transcript-based video editing

Import videos and audio into a timeline with tracks and scrubbers
Transcribe clips with word-level timestamps using Whisper (CPU/GPU or Qualcomm NPU)
Edit by editing the transcript: delete or change text, and the corresponding video segments are cut or split automatically
Sync transcript ↔ video: word-level timing powers precise edits and caption placement

2. AI-generated captions

Generate burned-in captions from the transcript, synced to word timings
Customize font, color, alignment, weight via the Text panel
Caption track integrated with the Remotion composition for export

3. B‑roll retrieval with SigLIP2

Embed images and videos using google/siglip2-base-patch16-224 (Hugging Face)
Each image gets one embedding; each video is split into 4–5 time sections and embedded
Text-based retrieval: describe what you want (e.g., "a skateboard", "food") and find matching assets from your assets/ folder
Auto-suggest from transcript: retrieval panel can use the current clip transcript as a search query to suggest B‑roll

4. On-device Whisper

Legacy mode: faster-whisper or PyTorch Whisper (CPU/CUDA)
Qualcomm NPU mode (default): Whisper-Small via pre-compiled ONNX/QNN models from Qualcomm AI Hub
Runs through a Python subprocess; no external API calls

5. Python labs for experimentation

nexa-caption-lab: Nexa SDK transcription with optional NPU support
nexa-video-context-lab: Qwen3‑VL for scene context—find timestamp ranges by natural-language query (e.g., "person walks to whiteboard"), clip export—usable as a standalone CLI

How We Built It

Architecture

Frontend: React 19 + React Router, Tailwind, Radix UI, Framer Motion
Desktop: Electron (no Tauri; we switched to Electron for compatibility)
Backend: Node.js/Express serving uploads, media, transcription, and retrieval
Video rendering: Remotion + FFmpeg
Python subprocess: Transcription (Whisper / NPU) and embeddings (SigLIP2) run as sidecar processes—no separate server

AI / ML stack

Task	Model / System	Notes
Speech transcription	Whisper (openai/whisper-small)	faster-whisper or transformers fallback
NPU transcription	Whisper-Small ONNX/QNN (Qualcomm AI Hub)	Via nexa-caption-lab
Image/video embeddings	SigLIP2 (google/siglip2-base-patch16-224)	Hugging Face Transformers
Scene context (lab)	Qwen3‑VL‑4B (nexa-video-context-lab)	Standalone CLI only

Data flow

User imports media → stored in out/<project-id>/
User transcribes selected clips → Whisper runs via Python subprocess, returns word timestamps
User edits transcript → segments mapped to time ranges; timeline is cut/split
User generates captions → Remotion composition renders with burned-in text
User searches for B‑roll → SigLIP2 embeddings queried; matching assets from assets/ returned

Challenges We Faced

Qualcomm NPU integration
Setting up ONNX/QNN models and nexa-caption-lab’s NPU path required careful handling of pre-compiled binaries, context binaries, and Python envs. We added a fallback to legacy Whisper when NPU isn’t available.
Transcript-to-segment alignment
Mapping edited transcript text back to Whisper’s word timestamps for cutting clips was non-trivial. We tokenize and align edited text with original words to produce segments with correct start/end times.
Video section embeddings
We split videos into temporal sections, sample frames per section, average their SigLIP2 embeddings, and store them with segment IDs like video.mp4#0.0-4.0 so retrieval can target specific time ranges.
Cross-platform and ARM64
On Windows ARM64, Remotion’s default Chrome Headless Shell wasn’t usable; we added support for an x64 Chrome Headless Shell under emulation for rendering.

Tech Stack

Frontend: React 19, React Router, Tailwind CSS, Radix UI, Framer Motion, Remotion
Desktop: Electron
Backend: Node.js, Express
Video: FFmpeg, Remotion
AI / ML: Hugging Face Transformers, SigLIP2, Whisper (faster-whisper / PyTorch), ONNX Runtime QNN (Qualcomm)
Python: transformers, PyTorch, faster-whisper, opencv-python, nexa-caption-lab