Inspiration

I’m building an AI assistant for hands-on workers. A core capability is letting an agent highlight the exact object a user needs—right in their view—so guidance can be precise and truly hands-free.

What it does

Given a natural-language prompt from the AI agent (e.g., “highlight the Phillips screwdriver”), the system finds, labels, and tracks that object in real time, keeping a stable bounding box as it moves.

How we built it

  • Agent: Gemini Live for intent + dialogue.
  • Highlight tool (backend): YOLOv8 (candidates) → CLIP ViT-L/14 (semantic match) → OpenCV CSRT (smooth tracking).
  • Frontend: React/TypeScript, getUserMedia, canvas frames, normalized coords; ~20 FPS updates.

Challenges we ran into

  • Grounding SAM-2 was too slow for live tracking on my setup.
  • Asking Gemini for boxes directly wasn’t accurate/stable enough.
  • Fixed React stale-state issues and handled OpenCV tracker API differences.

Accomplishments that we're proud of

  • Generalized, prompt-based tracking that works on arbitrary objects, not just COCO classes. I haven't seen anybody be able to do live object detection and tracking for any type of object.
  • Low-latency multi-object tracking with clean handoff from language → vision → tracker.

What we learned

Combining focused tools (LLM for intent, detector for speed, CLIP for meaning, tracker for motion) beats any single model. Small tricks—like expanding crops and using caption-style prompts—boost reliability.

What's next for R-Hat-Live

  • Lower latency and stronger trackers (e.g., OSTrack).
  • Deploy on my AR device so the agent can highlight objects in the user’s field of view, not just the camera feed.
  • Add re-ID and depth cues for more robust, 3D-aware guidance.

Built With

Share this project:

Updates