FrameShift

How FrameShift works under the hood

Inspiration

We wanted to build a tool that lets anyone make precise, AI-powered edits to video. Here is the workflow: import video, segment an object once, transform it, and propagate that change across every frame automatically.
The idea was to combine the precision of traditional video editing with the power of generative AI.

What it does

FrameShift is an AI-powered video editor that lets users upload a video, click
on any object to segment it using SAM2, and apply edits (recolor, remove, blur, resize) VIA Cloudinary that are composited onto the original frame using the binary mask. Users can also describe edits in natural language through an AI chat, Gemini generates a preview on the current frame, and with one click, propagates the transformation across all frames using key-frame generation and RIFE neural interpolation for smooth temporal consistency.

How we built it

Frontend: Next.js 16 with TypeScript, using a custom editor UI with a canvas
renderer, timeline scrubber, and real-time frame preview. State management is handled through a single useEditorState hook with unified polling for all async backend operations.
Backend: FastAPI (Python) orchestrating multiple AI services — YOLOv11 for object detection, SAM2 for point-click segmentation, Cloudinary for image
transformations (recolor, background removal, upscale, restore), and Google Gemini (gemini-3.1-flash-image-preview) for generative edits.
Frame Interpolation: RIFE (Real-Time Intermediate Flow Estimation) running on MPS/CUDA for generating smooth intermediate frames between AI-transformed key
frames.
Infrastructure: Deployed on Vultr (Compute Servers and K8n Cluster) with the backend serving frames from a temp filesystem and Cloudinary handling heavy image processing.

Challenges we ran into

Configuring Cloudinary APIs and smooth transformation of dynamic asset URLs Cloudinary's eager transformations and URL-based pipeline made it tricky to chain operations like gen_recolor with mask compositing.
SAM2 object bounding box detection and object segmentation with YOLOv11 Getting single-click segmentation working end-to-end required coordinating
YOLOv11 detection with SAM2's point-prompt segmentation.
Dealing with high latency loading times with RIFE image stitching, RIFE's neural network uses a shared global model on MPS/GPU, which meant we couldn't
parallelize interpolation segments without hitting tensor dimension mismatches. We had to run segments sequentially, and Gemini's output frames often came back at different resolutions (1440p vs 768p), causing RIFE to crash until we added automatic resize normalization.
Configuring Vultr: Setting up a Kubernetes compute cluster backend on Vultr required managing PyTorch with MPS support, SAM2 model weights, and RIFE's vendored model files.
Frontend polling lifecycle: With five concurrent async operations (detection, segmentation, editing, refinement, AI propagation), keeping the
frontend in sync.

Accomplishments that we're proud of

Single-click object segmentation that instantly generates an editable mask, no manual selection tools needed.
The AI chat-to-video pipeline: describe an edit in plain English, preview it on one frame, then propagate it across the entire video with one button.
RIFE interpolation producing temporally smooth results between AI-generated key frames, every Nth frame is transformed by Gemini, and RIFE fills the 7
frames in between.
Real-time mask border rendering on an HTML canvas with edge detection, giving users immediate visual feedback on what's selected.

What we learned

Generative AI models don't guarantee output dimensions match the input
always normalize before downstream processing.
Polling-based architectures need careful state machine design; a single stuck "processing" status can brick the entire UI.
Cloudinary's transformation pipeline is powerful but URL construction gets complex fast when chaining multiple effects with masks.
RIFE interpolation quality degrades significantly when key frames are too far apart

What's next for FrameShift

Real-time SAM2 mask propagation across frames using SAM2's video predictor
instead of copying single-frame masks.
Batch rendering pipeline with server-side FFmpeg for production-quality MP4
export with audio preservation.
Multi-object tracking: edit different objects independently and propagate each edit separately.
Cloud GPU acceleration for RIFE and Gemini processing to bring propagation time from minutes to seconds.