Inspiration
I’m building an AI assistant for hands-on workers. A core capability is letting an agent highlight the exact object a user needs—right in their view—so guidance can be precise and truly hands-free.
What it does
Given a natural-language prompt from the AI agent (e.g., “highlight the Phillips screwdriver”), the system finds, labels, and tracks that object in real time, keeping a stable bounding box as it moves.
How we built it
- Agent: Gemini Live for intent + dialogue.
- Highlight tool (backend): YOLOv8 (candidates) → CLIP ViT-L/14 (semantic match) → OpenCV CSRT (smooth tracking).
- Frontend: React/TypeScript,
getUserMedia, canvas frames, normalized coords; ~20 FPS updates.
Challenges we ran into
- Grounding SAM-2 was too slow for live tracking on my setup.
- Asking Gemini for boxes directly wasn’t accurate/stable enough.
- Fixed React stale-state issues and handled OpenCV tracker API differences.
Accomplishments that we're proud of
- Generalized, prompt-based tracking that works on arbitrary objects, not just COCO classes. I haven't seen anybody be able to do live object detection and tracking for any type of object.
- Low-latency multi-object tracking with clean handoff from language → vision → tracker.
What we learned
Combining focused tools (LLM for intent, detector for speed, CLIP for meaning, tracker for motion) beats any single model. Small tricks—like expanding crops and using caption-style prompts—boost reliability.
What's next for R-Hat-Live
- Lower latency and stronger trackers (e.g., OSTrack).
- Deploy on my AR device so the agent can highlight objects in the user’s field of view, not just the camera feed.
- Add re-ID and depth cues for more robust, 3D-aware guidance.
Built With
- clip
- gemini
- opencv
- python
- react
- typescript
- yolov8
Log in or sign up for Devpost to join the conversation.