Inspiration
We wanted to build a tool that lets anyone make precise, AI-powered edits to video.
Here is the workflow: import video, segment an object once, transform it, and propagate that change across every frame automatically.
The idea was to combine the precision of traditional video editing with the power of generative AI.
What it does
FrameShift is an AI-powered video editor that lets users upload a video, click
on any object to segment it using SAM2, and apply edits (recolor, remove, blur,
resize) VIA Cloudinary that are composited onto the original frame using the binary mask. Users
can also describe edits in natural language through an AI chat, Gemini
generates a preview on the current frame, and with one click, propagates the
transformation across all frames using key-frame generation and RIFE neural
interpolation for smooth temporal consistency.
How we built it
- Frontend: Next.js 16 with TypeScript, using a custom editor UI with a canvas
renderer, timeline scrubber, and real-time frame preview. State management is handled through a single useEditorState hook with unified polling for all async backend operations. - Backend: FastAPI (Python) orchestrating multiple AI services — YOLOv11 for
object detection, SAM2 for point-click segmentation, Cloudinary for image
transformations (recolor, background removal, upscale, restore), and Google Gemini (gemini-3.1-flash-image-preview) for generative edits. - Frame Interpolation: RIFE (Real-Time Intermediate Flow Estimation) running on
MPS/CUDA for generating smooth intermediate frames between AI-transformed key
frames. - Infrastructure: Deployed on Vultr (Compute Servers and K8n Cluster) with the backend serving frames from a temp
filesystem and Cloudinary handling heavy image processing.
Challenges we ran into
- Configuring Cloudinary APIs and smooth transformation of dynamic asset URLs Cloudinary's eager transformations and URL-based pipeline made it tricky to chain operations like gen_recolor with mask compositing.
- SAM2 object bounding box detection and object segmentation with YOLOv11
Getting single-click segmentation working end-to-end required coordinating
YOLOv11 detection with SAM2's point-prompt segmentation. - Dealing with high latency loading times with RIFE image stitching, RIFE's
neural network uses a shared global model on MPS/GPU, which meant we couldn't
parallelize interpolation segments without hitting tensor dimension mismatches. We had to run segments sequentially, and Gemini's output frames often came back at different resolutions (1440p vs 768p), causing RIFE to crash until we added automatic resize normalization. - Configuring Vultr: Setting up a Kubernetes compute cluster backend on Vultr required
managing PyTorch with MPS support, SAM2 model weights, and RIFE's vendored model
files.
- Frontend polling lifecycle: With five concurrent async operations
(detection, segmentation, editing, refinement, AI propagation), keeping the
frontend in sync.
Accomplishments that we're proud of
- Single-click object segmentation that instantly generates an editable mask, no manual selection tools needed.
- The AI chat-to-video pipeline: describe an edit in plain English, preview it
on one frame, then propagate it across the entire video with one button.
- RIFE interpolation producing temporally smooth results between AI-generated
key frames, every Nth frame is transformed by Gemini, and RIFE fills the 7
frames in between. - Real-time mask border rendering on an HTML canvas with edge detection, giving
users immediate visual feedback on what's selected.
What we learned
- Generative AI models don't guarantee output dimensions match the input
always normalize before downstream processing. - Polling-based architectures need careful state machine design; a single stuck
"processing" status can brick the entire UI.
- Cloudinary's transformation pipeline is powerful but URL construction gets
complex fast when chaining multiple effects with masks.
- RIFE interpolation quality degrades significantly when key frames are too far
apart
What's next for FrameShift
- Real-time SAM2 mask propagation across frames using SAM2's video predictor
instead of copying single-frame masks. - Batch rendering pipeline with server-side FFmpeg for production-quality MP4
export with audio preservation. - Multi-object tracking: edit different objects independently and propagate
each edit separately.
- Cloud GPU acceleration for RIFE and Gemini processing to bring propagation
time from minutes to seconds.
Built With
- cloudinary
- fastapi
- ffmpeg
- gemini
- k8
- konva
- next
- postgresql
- python
- pytorch
- sam
- vultr
- yolo







Log in or sign up for Devpost to join the conversation.