Inspiration
Stories live and die on continuity—the small, human details that make a world feel real. But in modern production, scenes are shot out of order, characters change outfits, props move between takes, and editors are left hunting through hours of footage to answer simple questions: When did we last see this character? Who were they with? Why did that jacket suddenly change? We felt that pain—and realized computer vision could do more than detect faces; it could protect the narrative.
That’s what inspired CastThread: a character-aware continuity companion that follows recurring characters across scenes even when their appearance shifts, maps who shares the screen with whom over time, and surfaces the moments that break immersion. By turning raw video into a living character graph + continuity timeline, CastThread helps creators stay focused on storytelling—catching the subtle inconsistencies before the audience does.
What it does
CastThread automatically tracks recurring characters across an entire video, even when they change outfits, lighting, camera angle, or scene. It detects and re-identifies faces and people over time to assign consistent character IDs, then produces a timeline of where each character appears, how long they’re on screen, and which scenes they span. This turns raw footage into a structured “cast continuity” layer that editors and analysts can query.
On top of identity tracking, CastThread builds a character relationship graph by aggregating co-occurrences: who appears with whom, in which scenes, and in what order. The output is an interactive graph plus scene-level summaries that reveal narrative structure (e.g., recurring pairings, isolated characters, missing interactions) and make it easy to navigate to the exact timestamps behind each edge.
Finally, CastThread flags continuity issues by comparing expected presence and visual attributes across adjacent shots and sequences. It highlights anomalies such as a character disappearing mid-sequence, sudden wardrobe/prop changes, or inconsistent object states, and links each alert to the relevant frames and timestamps—providing actionable checkpoints for narrative editing and quality control.
How we built it
We designed CastThread as a modular video-understanding pipeline optimized for long-form continuity, where identities must persist across scene cuts and wardrobe changes. The architecture separates (1) detection + tracking, (2) identity association, and (3) narrative/continuity reasoning so each stage can be improved independently without breaking the rest. Videos are first segmented into shots/scenes, then processed frame-sampled to balance cost and accuracy. We store all intermediate artifacts (tracks, embeddings, timestamps, scene IDs) in a structured “character memory” layer so later passes can re-link identities as new evidence appears.
At the core, we combine person detection and multi-object tracking to produce stable tracklets within a shot, then run a re-identification (ReID) + face recognition ensemble to merge tracklets across scenes. To handle outfit changes, we weight identity decisions toward face embeddings when available, and fall back to body ReID + temporal co-occurrence priors (who tends to appear together, in what sequence) when faces are occluded. A lightweight graph builder then converts merged identities into a Character Co-Occurrence Graph (nodes = characters, edges = shared screen time with timestamps and scene boundaries), enabling queries like “who appears with whom” and “when do relationships emerge.”
Continuity checking is implemented as a set of rules and anomaly detectors over the timeline and graph. We flag missing-character anomalies by learning expected presence patterns across adjacent scenes (e.g., conversation participants) and detecting abrupt dropouts. For prop/wardrobe jumps, we attach attribute tags to tracks (dominant colors, key accessories/props) and detect sudden changes that violate short-range continuity constraints within a narrative unit. Outputs are delivered as an editor-friendly report: per-scene character roster, relationship graph snapshots over time, and a ranked list of continuity warnings with the exact timestamps and supporting evidence frames.
Challenges we ran into
The biggest challenge was getting reliable video understanding in a hackathon timeframe. “Best computer vision” quickly turns into a trade-off between accuracy, speed, and cost: higher-quality models improved recognition and temporal consistency, but added latency and GPU load. We also ran into practical issues with real-world video inputs—variable frame rates, motion blur, low light, and rapid scene changes—where frame-by-frame predictions would flicker or contradict each other without additional smoothing or temporal logic.
On the engineering side, stitching the pipeline together was harder than expected. Extracting frames, batching inference, and reassembling results into a coherent timeline introduced edge cases (dropped frames, mismatched timestamps, and inconsistent clip boundaries). We spent a lot of time debugging “it works on one video but not another” problems, and had to make pragmatic choices about what to support well versus what to defer.
Finally, evaluation was non-trivial. Without a labeled dataset tailored to our demo videos, it was difficult to quantify improvements beyond qualitative checks. We relied on quick sanity tests, targeted example clips, and iterative tuning, which helped us ship a compelling demo—but we’d want more systematic benchmarking and broader test coverage to confidently claim robustness across diverse video types.
Accomplishments that we're proud of
We built an end-to-end pipeline that ingests raw video, segments it into meaningful moments, and produces structured outputs (events, entities, and timelines) that are easy to search, summarize, and reuse. The system is optimized for real-world footage—handling varied lighting, motion, and camera angles—so results stay consistent across different creators and content types.
On the technical side, we engineered a scalable inference stack that keeps latency low while preserving accuracy, with modular components for detection, tracking, and temporal reasoning that can be improved independently. We also focused on reliability: clear confidence scoring, debuggable intermediate outputs, and guardrails that reduce noisy or misleading interpretations.
From a product perspective, we turned complex CV into an intuitive experience: upload a video, get actionable understanding, and iterate quickly. The UX emphasizes speed-to-insight—surfacing the “why” behind outputs and enabling users to jump directly to key moments—so creators and teams can move from footage to decisions (or edits) in minutes, not hours.
What we learned
Building CastThread taught us that “character continuity” is less about perfect face recognition and more about robust identity over time. Outfit changes, lighting shifts, camera angles, and occlusions quickly break naive similarity matching, so we learned to treat identity as a multi-signal problem—combining face cues, body/pose, temporal proximity, and co-occurrence context. We also realized that continuity isn’t purely visual: scene boundaries and editing structure matter, so aligning detections to shots and sequences is essential to avoid false continuity “errors” that are really just intentional cuts.
We also learned that the most valuable output isn’t just a detection score—it’s editor-friendly insights. Relationship graphs become useful only when they’re explainable (why two identities were linked) and navigable (when they appeared together). For continuity flags, precision matters more than recall: too many false alarms erode trust, so we focused on making alerts actionable (e.g., “character missing between these two adjacent shots” with evidence frames) and tunable based on the production’s style.
Finally, we grew in how we scope ambitious vision projects for hackathons: start with a tight end-to-end loop, then iterate. Getting a simple pipeline working—track → cluster identities → build co-occurrence graph → generate continuity checks—helped us validate the concept quickly, then improve the weakest links. We’re leaving with a clearer roadmap: better cross-scene re-ID, stronger temporal reasoning, and a UX that lets creators confirm, correct, and teach the system over time.
What's next for CastThread
Next, we’ll turn CastThread from a strong demo into a reliable pipeline for real-world edits. Our immediate focus is improving identity persistence across scenes: combining face/body re-ID with voice cues (when available), temporal smoothing, and an “evidence trail” per match (why we think Character A in Scene 3 is the same as Scene 1). We’ll also harden the continuity engine by expanding beyond wardrobe to props, scene context, and shot boundaries—so the system can flag what changed, when it changed, and how confident it is, with links to the exact timestamps and frames.
From there, we’ll productize around the workflows editors actually use. We’ll ship a lightweight web app and NLE-friendly export (e.g., markers/EDL/JSON) that lets teams review character timelines, co-occurrence graphs, and continuity alerts in minutes—not hours. We’ll add human-in-the-loop corrections (one-click “same person / different person”) that retrain per-project, plus team collaboration features (notes, assignments, audit logs) to support episodic production.
Finally, we’ll validate with creators and studios by running pilots on short films, episodic content, and YouTube series, measuring time saved and error reduction. The roadmap extends to “story intelligence”: relationship graph evolution across episodes, automatic “previously on” candidate selection, and consistency checks against scripts/storyboards. The goal is a durable product that becomes a standard QA layer in post-production, not just a hackathon prototype.
Built With
- edl
- json
Log in or sign up for Devpost to join the conversation.