Inspiration
Searching for files in your system is tiring. Spending 15 minutes digging through your Screenshots directory to find a screenshot of a QR code is beyond frustrating. It happens really often that the files in your system are saved with a totally non-descriptive name (external downloads or screenshots), and you don't care to change it, until you end up needing to find that file a year down the line.
We wanted to build search that felt natural - something that feels like searching for what you remember in the file and not its name or metadata. Modern workflows scatter information across text files, screenshots, voice notes, videos. But the regular search tools that exist by default on Windows and Linux can never handle anything beyond basic keyword and file type rules. (and Spotlight search doesn't seem to work half the time).
You shouldn't have to meticulously reorganise your files after downloading meeting notes or a textbook so that the search app can find it. Sift was built to let the computer sift through the records for you as you go about with your work.
What it actually is
Sift is a local-first multimodal memory engine that lets you search across text, images, audio, and video using natural language.
- Semantic Search: Describe what you're looking for ("that clip of someone dancing at a masquerade") and Sift retrieves relevant files.
- Unified Embedding Space: All modalities are mapped into a shared vector space, enabling true cross-modal retrieval.
- Audio Understanding: Finds audio by meaning via transcription and learned alignment.
- OCR & Visual Search: Extracts and understands text inside images.
- Result Bundling: Groups related results into coherent clusters instead of showing noisy flat lists.
- Local & Fast: Runs entirely on-device with zero cloud dependency.
How we built it
Sift is built around the idea of mapping all types of file into a common embedding space, to allow text to be compared with audio / images / video (and of course other text documents).
Core Model
We use Qwen3-VL-Embedding-2B as the backbone to embed text, images, and video (only visual frames) into a shared 2048-dimensional vector space. Since the Qwen model fails to support audio embeddings natively, we designed a "bridge" to connect audio models to also map into this embedding space.
Whisper Chain: Use whisper to transcribe the speech in the audio clip and pass the text embedding through Qwen to get an accurate embedding for the spoken word content in the clip
Fine tuned CLAP with projection: Built a projection layer to fine tune the CLAP model's output into the Qwen embedding space. Trained on a subset of the AudioSetCaps dataset that describes the nature of the audio rather than the speech content. (To complement the Whisper chain).
Other Pipelines
- OCR via EasyOCR → embedded as text
- Metadata generation for fallback retrieval
Indexing System
- Background daemon that watches for content changes in files by using BLAKE3 hashing.
- Vector DB storage and retrieval via Qdrant
- File-type-based routing into specialized pipelines
Search & Ranking
- Hybrid scoring combining:
- embedding similarity
- time based proximity
- filename similarity (Jaccard)
- embedding similarity
Interface
- Built a minimal PySide6 desktop UI (using the Qt framework)
- Inspired by the Quake style drop down terminals.
Challenges we ran into
Cross-modal alignment Audio embeddings don’t naturally align with vision-language models. Bridging CLAP to Qwen required designing and training a projection head that preserved semantic meaning and complemented the other pipelines.
Dataset building Installing the dataset by scraping through youtube links to download clips was painful, especially with trying to avoid rate limits and anti-bot measures. We weren't able to extract the entire dataset, but managed to get a large subset with enough tricks to avoid getting rate limited or suspended.
Latency vs. quality tradeoffs Running large models locally meant carefully managing RAM usage (especially since the app involves a background daemon running continuously), batching, and caching to keep the system responsive.
Result relevance Raw vector similarity alone produced noisy outputs. We had to design a hybrid ranking and bundling system to make results actually useful.
What's next for Sift
We aim to be a productivity tool for power-users and enthusiasts who spend a lot of their time with computers and can effectively utilise the app to manage their files.
- Better personalization: Learn user-specific relevance over time
- Optional cloud sync for cross-device memory ( and monetisation 🤑 )
- Plugin system for custom pipelines (e.g., PDFs, code, emails)
- Smarter and faster models ( maybe adding a Reranker to the search bundling pipeline ).

Log in or sign up for Devpost to join the conversation.