Videre

Inspiration

Video editing today forces creators to choose between privacy, cost, and connectivity. Professional tools like Adobe Creative Cloud or DaVinci Resolve Studio are expensive, and cloud AI features require uploading sensitive footage or depend on stable internet.

We wanted to change that. Videre is built for:

  • Independent creators who can’t afford enterprise subscriptions
  • Professionals working with confidential content (medical, legal, corporate)
  • Users in low-bandwidth or offline environments where cloud AI isn’t usable

Our goal was a desktop video editor that runs fully offline, uses local AI for transcription and retrieval, and lets you edit by simply selecting or deleting text in the transcript. We took inspiration from transcript-based editing tools but focused on making everything run on your own hardware—including Qualcomm Snapdragon NPUs—without ever sending data to the cloud.


What We Built

1. Transcript-based video editing

  • Import videos and audio into a timeline with tracks and scrubbers
  • Transcribe clips with word-level timestamps using Whisper (CPU/GPU or Qualcomm NPU)
  • Edit by editing the transcript: delete or change text, and the corresponding video segments are cut or split automatically
  • Sync transcript ↔ video: word-level timing powers precise edits and caption placement

2. AI-generated captions

  • Generate burned-in captions from the transcript, synced to word timings
  • Customize font, color, alignment, weight via the Text panel
  • Caption track integrated with the Remotion composition for export

3. B‑roll retrieval with SigLIP2

  • Embed images and videos using google/siglip2-base-patch16-224 (Hugging Face)
  • Each image gets one embedding; each video is split into 4–5 time sections and embedded
  • Text-based retrieval: describe what you want (e.g., "a skateboard", "food") and find matching assets from your assets/ folder
  • Auto-suggest from transcript: retrieval panel can use the current clip transcript as a search query to suggest B‑roll

4. On-device Whisper

  • Legacy mode: faster-whisper or PyTorch Whisper (CPU/CUDA)
  • Qualcomm NPU mode (default): Whisper-Small via pre-compiled ONNX/QNN models from Qualcomm AI Hub
  • Runs through a Python subprocess; no external API calls

5. Python labs for experimentation

  • nexa-caption-lab: Nexa SDK transcription with optional NPU support
  • nexa-video-context-lab: Qwen3‑VL for scene context—find timestamp ranges by natural-language query (e.g., "person walks to whiteboard"), clip export—usable as a standalone CLI

How We Built It

Architecture

  • Frontend: React 19 + React Router, Tailwind, Radix UI, Framer Motion
  • Desktop: Electron (no Tauri; we switched to Electron for compatibility)
  • Backend: Node.js/Express serving uploads, media, transcription, and retrieval
  • Video rendering: Remotion + FFmpeg
  • Python subprocess: Transcription (Whisper / NPU) and embeddings (SigLIP2) run as sidecar processes—no separate server

AI / ML stack

Task Model / System Notes
Speech transcription Whisper (openai/whisper-small) faster-whisper or transformers fallback
NPU transcription Whisper-Small ONNX/QNN (Qualcomm AI Hub) Via nexa-caption-lab
Image/video embeddings SigLIP2 (google/siglip2-base-patch16-224) Hugging Face Transformers
Scene context (lab) Qwen3‑VL‑4B (nexa-video-context-lab) Standalone CLI only

Data flow

  1. User imports media → stored in out/<project-id>/
  2. User transcribes selected clips → Whisper runs via Python subprocess, returns word timestamps
  3. User edits transcript → segments mapped to time ranges; timeline is cut/split
  4. User generates captions → Remotion composition renders with burned-in text
  5. User searches for B‑roll → SigLIP2 embeddings queried; matching assets from assets/ returned

Challenges We Faced

  • Qualcomm NPU integration
    Setting up ONNX/QNN models and nexa-caption-lab’s NPU path required careful handling of pre-compiled binaries, context binaries, and Python envs. We added a fallback to legacy Whisper when NPU isn’t available.

  • Transcript-to-segment alignment
    Mapping edited transcript text back to Whisper’s word timestamps for cutting clips was non-trivial. We tokenize and align edited text with original words to produce segments with correct start/end times.

  • Video section embeddings
    We split videos into temporal sections, sample frames per section, average their SigLIP2 embeddings, and store them with segment IDs like video.mp4#0.0-4.0 so retrieval can target specific time ranges.

  • Cross-platform and ARM64
    On Windows ARM64, Remotion’s default Chrome Headless Shell wasn’t usable; we added support for an x64 Chrome Headless Shell under emulation for rendering.


Tech Stack

  • Frontend: React 19, React Router, Tailwind CSS, Radix UI, Framer Motion, Remotion
  • Desktop: Electron
  • Backend: Node.js, Express
  • Video: FFmpeg, Remotion
  • AI / ML: Hugging Face Transformers, SigLIP2, Whisper (faster-whisper / PyTorch), ONNX Runtime QNN (Qualcomm)
  • Python: transformers, PyTorch, faster-whisper, opencv-python, nexa-caption-lab

Built With

Share this project:

Updates