R-Hat-Live

Inspiration

I’m building an AI assistant for hands-on workers. A core capability is letting an agent highlight the exact object a user needs—right in their view—so guidance can be precise and truly hands-free.

What it does

Given a natural-language prompt from the AI agent (e.g., “highlight the Phillips screwdriver”), the system finds, labels, and tracks that object in real time, keeping a stable bounding box as it moves.

How we built it

Agent: Gemini Live for intent + dialogue.
Highlight tool (backend): YOLOv8 (candidates) → CLIP ViT-L/14 (semantic match) → OpenCV CSRT (smooth tracking).
Frontend: React/TypeScript, getUserMedia, canvas frames, normalized coords; ~20 FPS updates.

Challenges we ran into

Grounding SAM-2 was too slow for live tracking on my setup.
Asking Gemini for boxes directly wasn’t accurate/stable enough.
Fixed React stale-state issues and handled OpenCV tracker API differences.

Accomplishments that we're proud of

Generalized, prompt-based tracking that works on arbitrary objects, not just COCO classes. I haven't seen anybody be able to do live object detection and tracking for any type of object.
Low-latency multi-object tracking with clean handoff from language → vision → tracker.

What we learned

Combining focused tools (LLM for intent, detector for speed, CLIP for meaning, tracker for motion) beats any single model. Small tricks—like expanding crops and using caption-style prompts—boost reliability.

What's next for R-Hat-Live

Lower latency and stronger trackers (e.g., OSTrack).
Deploy on my AR device so the agent can highlight objects in the user’s field of view, not just the camera feed.
Add re-ID and depth cues for more robust, 3D-aware guidance.

Built With

Updates

Fabian Arevalo started this project — Oct 18, 2025 05:20 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.