Sendit

login screen
create board screen
share board code
boards screen
extracted bubble graph for taste signals
pasting link into empty board for extraction
empty board screen
swipeable suggestion cards
Our fine tuned model beats base line.
Embedding cosine similarity distribution histogram after usage
Micro-services and classification architecture.

Inspiration

Gen Z friends don't text "let's hang out." They send each other reels. A rooftop bar, a restaurant, a club night, a recipe, a trip idea — forwarded with no caption, or maybe just "we should do this." The friend replies "yes omg" and then absolutely nothing happens.

That reel disappears into the group chat. Buried under memes and voice notes and people asking what time they're meeting. The plan that was genuinely wanted by everyone just dies. Every time.

"The problem isn't that we don't know what we want to do. We literally send it to each other every day. The problem is nothing ever converts."

And it's not just event reels. Political content, brainrot, dark humour, aesthetic videos, recipes, music — everything a group sends each other is actually taste data. It's a portrait of who they are as a group. It's just sitting in Instagram DMs doing nothing. Instagram's own data shows Reels DMs are now the most common interaction on the platform. Every one of those sends is an expression of intent. Nobody is doing anything with it.

We built Sendit to capture that signal.

What it does

Sendit is a shared board where your friend group drops reels from anywhere — Instagram, TikTok, YouTube Shorts. One person creates a board, shares a join code to the group chat, and everyone joins in one tap. Then whenever anyone sees something they'd want to do together, they share it to the board instead of letting it die in the chat.

The AI doesn't just read hashtags. It watches the video — extracting frames, reading descriptions, identifying venues, classifying content — and projects each reel into a 50-dimensional semantic vector that captures what the content is actually about. As reels accumulate, Sendit builds a living taste profile for the group: their dominant activity types, food preferences, aesthetic vibe, price comfort, and location patterns.

When the group is ready, Sendit surfaces one specific, reasoned plan suggestion — not a list of options. "Three of you have been sending Japanese dinner reels for two weeks. You're all free Saturday at 7. Here's a spot in Shoreditch. Who's in?" The group votes In, Maybe, or Out with a live commitment tally.

The core flow: Drop a reel (2 taps) → AI reads it → Group taste builds → Plan suggested → Night happens.

We turn the reels your friends send each other into nights you'll actually remember.

How we built it

Sendit is a collaborative activity recommendation platform where friend groups share social media content (Instagram Reels, TikTok videos, YouTube Shorts), build a collective taste profile, and receive AI-generated activity suggestions. The system's core innovation is a multi-modal embedding-like vector system that uses Google Gemini's vision+language capabilities to convert video content into structured 50-dimensional semantic fingerprints — without any traditional ML embedding libraries.

System Architecture

+--------------------+       +---------------------+       +-------------------+
|   Mobile Client    | <---> |   FastAPI Backend    | <---> |     Supabase      |
|   (Expo / React    |       |   (Python 3.13)     |       |   (PostgreSQL +   |
|    Native)         |       |                     |       |    Auth + REST)   |
+--------------------+       +---------------------+       +-------------------+
                                      |
                              +-------+-------+
                              |               |
                     +--------v--+    +-------v--------+
                     |  Gemini   |    |  Gemini 2.5    |
                     |  2.5 Flash|    |  Flash Lite    |
                     |  (Suggest)|    |  (Classify)    |
                     +-----------+    +----------------+

Frontend: Expo React Native (iOS, Android, Web) with Expo Router, Zustand state management, D3-force physics for blob graph visualization
Backend: FastAPI with async httpx, deployed on Vercel as serverless functions
Database: Supabase PostgreSQL with REST API and RLS policies
AI: Two Google Gemini models at different tiers for classification and recommendation

The Three-Stage AI Pipeline

Stage 1: Multi-Platform Content Extraction — Platform-specific scrapers for Instagram (Googlebot UA + Open Graph), TikTok (embedded JSON + yt-dlp), and YouTube Shorts (storyboard spec + yt-dlp). Each video yields 8 evenly-spaced frames via ffmpeg, encoded as base64 data URLs.

Stage 2: 50-Noun Semantic Vector Classification — Rather than using sentence-transformers or vector databases, we use Gemini Flash Lite's multi-modal understanding to project video content into a fixed 50-dimensional semantic space. 50 human-interpretable nouns (Food, Travel, Party, Culture, Beach, Music, etc.) each rated {0, 0.25, 0.5, 0.75, 1.0}. This functions as an embedding-like vector without any ML infrastructure. When events are detected, a second pass with Google Search grounding enriches with ticket prices and times.

Stage 3: Context-Aware Recommendation — Gemini Flash builds a comprehensive prompt from the group's taste profile, swipe-weighted reel history (liked reels weighted higher), calendar availability, and a 14-day lookahead window. Returns specific, actionable suggestions with venue names, dates, prices, and booking links.

Challenges we ran into

Instagram scraping is hostile to automation. Instagram blocks most automated requests. We discovered that using a Googlebot User-Agent bypasses this and returns Open Graph metadata reliably — a non-obvious workaround that took trial and error.
TikTok embeds its data in deeply nested JavaScript. The __UNIVERSAL_DATA_FOR_REHYDRATION__ JSON blob buried in script tags was our only reliable extraction path. Parsing it required careful regex and fallback handling when the structure changed between video types.
Video frame extraction at scale. We needed to extract 8 representative frames from each video for Gemini's multi-modal classification. Getting ffmpeg to work reliably across platforms (Instagram direct URLs, yt-dlp downloaded files, YouTube storyboard sheets) required three different extraction strategies.
Gemini model deprecation mid-hackathon. We started with gemini-2.0-flash which returned 404 errors — it had been deprecated. Switching to gemini-2.5-flash and gemini-2.5-flash-lite fixed it, but cost us debugging time.
Making the embedding vector consistent. LLMs are non-deterministic. Two calls classifying the same restaurant video could return different rating values. We solved this by quantizing the 50-noun vector to only 5 allowed values ({0, 0.25, 0.5, 0.75, 1.0}) and enforcing normalization server-side, which dramatically improved consistency.
Swipe data not reaching the backend. Personalised suggestions were stuck loading because the frontend was storing swipe history locally but not passing liked/disliked reel IDs to the suggestion generation endpoint. Debugging the disconnect between client-side state and server-side prompt building was a subtle but critical fix.

Accomplishments that we're proud of

The 50-noun semantic vector system. Instead of adding sentence-transformers, FAISS, or a vector database, we designed a system where Gemini projects video content into a fixed 50-dimensional space using human-interpretable nouns (Food, Travel, Party, Culture, etc.). It works like an embedding without any ML infrastructure — just a carefully crafted prompt and a normalization function.
Multi-modal classification that actually watches the video. Our pipeline sends 8 extracted video frames alongside the text description to Gemini. The AI sees the visual content — not just metadata. This means a reel of a rooftop bar with no caption still gets classified correctly.
Cross-platform scraping from 3 platforms. Instagram Reels, TikTok, and YouTube Shorts all land on the same board and get processed identically. Each platform required a completely different extraction strategy, but the user experience is uniform.
Two-pass event enrichment with Google Search. When Gemini detects an event-oriented location, a second pass uses Google Search grounding to fetch real ticket prices and event times — turning a vague "this looks fun" reel into an actionable card with booking information.
End-to-end AI pipeline built by 4 people. From scraping to classification to taste profiling to suggestion generation — the entire loop works. A group can paste URLs, see them classified, build a taste profile, and receive specific activity suggestions with venue names, dates, and prices.
The blob graph. A D3-force physics simulation that visualises the group's content as organic, tappable bubbles clustered by classification. It makes the AI's understanding of the group feel tangible and alive.

What we learned

LLMs can replace traditional embeddings for structured classification. The conventional approach would be sentence-transformers + cosine similarity + a vector database. We discovered that a well-prompted multi-modal LLM can produce structured, interpretable vectors that serve the same purpose — with zero infrastructure overhead and the bonus of vision understanding.
Group taste is a fundamentally different signal. Every recommendation engine builds profiles for individuals. Building one for a friend group — the intersection of taste, not the union — produces a qualitatively different kind of suggestion. "You might like this" vs "three of you have been sending this exact type of content."
Quantization makes LLM outputs reliable. Constraining Gemini to 5 discrete rating levels instead of continuous floats made the classification system dramatically more consistent. The same video classified twice produces nearly identical vectors.
The two-model strategy matters for cost. Using gemini-2.5-flash-lite for high-volume per-reel classification and gemini-2.5-flash for low-volume creative suggestion generation was the right split — cheaper where it counts, more capable where it matters.
Scraping social platforms is an arms race. Every platform has different anti-bot measures, different data structures, and different failure modes. Building resilient scrapers with proper fallbacks is harder than the AI part.

What's next for Sendit

Calendar integration — Each member links their Google Calendar privately. The app computes when everyone's free and only suggests plans within those windows. Nobody sees each other's events — just a free/busy mask.
Commitment nudges — If 3 of 5 members confirm and 2 haven't, those 2 get a private nudge from the app: "3 of your friends have already confirmed for Saturday." Social pressure without public embarrassment.
Memory pages — When a plan actually happens, Sendit creates an event page pre-filled with the reel that started it. Everyone drops photos and one-line memories. An AI-written chapter narrates the night. Over time, it becomes a scrollable timeline of every night the group had together.
Group manifesto — An AI-written character study of the group based on their taste profile. "The Chaotic Intellectuals. Equally likely to send you a political essay or a Shrek meme." The screenshot that gets forwarded to the group about itself.
Native share sheet — Sendit appears alongside WhatsApp in Instagram's share options. Two taps to share a reel to your board, identical to forwarding it to a friend. Zero new behaviour.
Venue monetisation — Once Sendit has taste graphs for enough groups, venues and promoters pay to be surfaced as suggestions to the right group at the right time. Not ads — native, high-intent referrals.