Sift is a high-performance local embedding and retrieval engine designed for instant multimodal semantic search across text, images, audio, and video.
The backbone of Sift is Qwen3-VL-Embedding-2B, which natively handles text, images, and video.
- Unified Vector Space: All modalities are mapped into a shared 2048-dimensional space.
Unified search for audio is achieved through a CLAP-to-Qwen adapter:
- Audio Backbone:
laion/clap-htsat-unfused(frozen). - Projection Head: A learned 2-layer MLP (512 -> 2048) that aligns CLAP's audio embeddings with Qwen's vision-text space.
- Training: Optimized using Contrastive InfoNCE loss on the AudioSetCaps dataset to relate audio features and textual/visual concepts.
- OCR Chain: Uses EasyOCR to extract text from images, which is then embedded via Qwen for semantic search.
- Transcription Chain: Uses Faster-Whisper to generate transcripts from audio, which are also embedded via Qwen to provide text-based retrieval of audio segments.
Sift includes a real-time filesystem monitor based on the watchdog library. It automatically detects new, modified, or moved files within your monitored directories and indexes them instantly.
- Initial Scan: On startup, the daemon performs a full scan to catch any changes that occurred while it was offline.
- Event-Driven: Uses OS-level file system events (via
inotifyon Linux,FSEventson macOS) for efficient, low-overhead monitoring. - Hidden File Filtering: Automatically ignores hidden files and directories (e.g., those starting with
.).
Sift uses an incremental indexing strategy to minimize redundant processing and maximize performance:
- Change Detection: Uses BLAKE3 hashing to track file modifications and skip unchanged files.
- Vector Database: Integrated with Qdrant for high-speed similarity search and persistent storage.
- Pipelines: Files are automatically routed to appropriate processing chains based on MIME type and file extension.
- Source Mapping: Maintains strict mapping between source paths and vector IDs to allow for re-indexing and deletion.
- Multimodal Retrieval: Search by natural language to find related text, images, audio recordings, or video clips in a single query.
- Result Bundling: Groups similar snippets or related files using a hybrid scoring system that considers embedding similarity, temporal proximity, and filename Jaccard similarity.
Sift is configured via a config.json file located in your user's configuration directory:
- Linux:
~/.config/sift/config.json - macOS:
~/Library/Application Support/sift/config.json - Windows:
%APPDATA%\sift\config.json
You can specify which folders Sift should index by modifying the monitored_directories list in your config.json:
{
"monitored_directories": [
"/home/user/Documents",
"/home/user/Pictures",
"/home/user/Videos/clips"
]
}By default, Sift will create this file and point it to a trusted/ folder within the project directory if no configuration exists. (for simple tests)
src/daemon.py: The unified entry point that preloads models, starts the indexer, and launches the UI.src/indexer/: Core indexing logic, file routing, and database interaction.src/embed/: Multimodal embedding pipelines (Qwen3-VL, CLAP, Whisper).src/search/: Similarity search engine and result bundling logic.src/ui/: PySide6-based desktop application.models/: Storage for pre-trained model weights.tests/: Comprehensive test suite for pipelines and components.
The project includes a futuristic, minimalist desktop application built with PySide6:
- Bundle-Centric Results: Search results are grouped into up to three top-ranked bundles plus a recognized-entities list.
Alt + Space: Toggle the search bar visibility (Global shortcut).Esc: Hide the search bar and clear the current query.Enter: Execute a search or open the currently selected file.Arrow Keys: Navigate through result bundles and file lists.
- Python 3.12 (managed via
uv) - Docker (required for running the Qdrant vector database)
Clone the repository and install dependencies using uv:
uv syncDownload the required pre-trained weights into the models/ directory:
uv run python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='Qwen/Qwen3-VL-Embedding-2B', local_dir='models/Qwen3-VL-Embedding-2B')"Run the Qdrant vector database locally using Docker:
docker pull qdrant/qdrant
mkdir -p qdrant_storage
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
qdrant/qdrantAlternatively, use podman based on your system preference.
Install the following libraries to support the Qt-based UI on Linux systems:
sudo apt update
sudo apt install -y libgl1 libegl1 libdbus-1-3 libxkbcommon-x11-0 \
libxcb-cursor0 libxcb-icccm4 libxcb-image0 libxcb-keysyms1 \
libxcb-randr0 libxcb-render-util0 libxcb-shape0 libxcb-xfixes0 \
libxcb-xinerama0 libx11-xcb1 libxrender1 libxi6 libxcomposite1 libxtst6To perform a one-time indexing of files in your configured directories:
uv run python -m src.indexer.run_indexerTo run the main daemon that explicitly preloads the shared Qwen model, performs the startup scan, watches for new/modified files, and then launches the desktop UI process:
uv run python -m src.daemonTo run the daemon persistently in the background (even after closing your terminal), use nohup:
nohup uv run python -m src.daemon > daemon.log 2>&1 &Current startup order for the unified daemon:
- Preload the shared Qwen embedder once.
- Perform the initial indexing scan of monitored directories.
- Start the watchdog observer for new and modified files.
- Launch the PySide6 UI process.
Notes:
- The unified daemon is intended to share the in-process Qwen singleton between indexing and search.
- On Linux/WSL, desktop UI visibility and global hotkey behavior may depend on the host desktop environment and current UI implementation.
If you only want the legacy indexing-only watcher without the UI process attached:
uv run python -m src.indexer.daemonTo launch the interactive desktop application:
uv run python main.pyTo run a simple CLI-based search:
uv run python main.py --cli- Prepare Subset:
uv run python -m src.embed.train.prepare_subset(Prepares training data). - Fetch Audio:
uv run python -m src.embed.train.fetch_yt_sample(Downloads training samples). - Train Loop:
uv run python -m src.embed.train.train_loop(Starts the alignment training).


