Inspiration

Jack has some serious ADHD, and he often finds himself getting lost when reading the dense lines of logic found in proof based mathematics. He often asks LLMs for help, but that bit of friction between looking down at notes and looking back up at LLMs is enough to disrupt his focus, especially when his roommates are playing "slap card" right outside his door. If only he can have an augmented reality interface to help him study!

What it does

GeminEye is an AR homework tutor that turns any camera into an intelligent study companion. Point your camera at handwritten notes, and the system detects the paper in real-time, segments it into regions (equations, diagrams, text blocks), and indexes everything with visual and text embeddings. Students can then ask questions grounded in their actual notes — by voice through the Gemini Live API, by pointing at a specific equation with their finger (point-to-ask) (but this feature broke when, at the last minute, Jack realized that he needs SSL encryption and a domain!), or by typing in a chat. The system renders answers as AR overlays directly on the paper: sticky-note "petals" anchored to the page, variable definition panels, and full LaTeX-rendered explanations — all perspective-matched to the camera view. We also added multi-device streaming, so a separate IoT-enabled device can serve as the camera while a laptop displays the AR experience.

How we built it

A Next.js 16 frontend proxies API calls to a FastAPI backend. The backend lazy-loads GPU models on demand — DocLayout-YOLO for layout detection, GOT-OCR2 for math/handwriting OCR, ColQwen2 for visual embeddings, and all-MiniLM-L6-v2 for text embeddings — totaling ~8.4 GB VRAM on the 24 GB L4 GPU that we rented with Google Cloud's free trial credits. (At the last minute… as you do everything.. nice going, Jack)

Paper detection uses DocAligner, an open-source ONNX heatmap model with a per-frame confidence decay that tries to prevent flicker from hand occlusion.

We implemented segment search is a hybrid retrieval pipeline with the following stuffs:

  1. ColQwen2 visual similarity (50% weight)
  2. ChromaDB dense text embeddings (15%)
  3. BM25 keyword search (15%)
  4. a NetworkX knowledge graph (10%)
  5. temporal decay (10%).

A confidence gate (threshold 0.15) prevents Gemini from hallucinating when no relevant content is found.

Hand tracking is a small enough model to run client-side via MediaPipe WASM (CPU-only, no GPU) The point-to-ask flow uses a 600ms dwell ring and a homography-based finger-to-page coordinate mapping… but due to Jack’s tendency to feature creep, it didn’t work during our demo :(

Gemini Live API integration provides real-time voice conversation through a WebSocket client. Gemini responses can contain structured [VDPAGE] and [LEPAGE] blocks that get parsed (with streaming-robust tag buffering across chunks) into variable definition and long explanation panels that appear in the Augmented Reality display of the user.

We built iteratively through 5 integration phases — each phase copies and extends the previous phase's library files, keeping each phase self-contained and independently runnable. When moving fast with Antigravity, Claude, and Codex, we found that this is the best way to keep things relatively organized 🫨

Challenges we ran into

The Unity/Meta Quest pivot. We initially built for Meta Quest using Unity and a custom YOLOv8n paper detection model, but we constantly ran into an issue with EMA temporal smoothing on paper corners caused corners to cross over each other when moving at different rates, creating a self-intersecting "bowtie" polygon.

Detection flicker from hand occlusion was also a constant issue. When a student's hand passes over the paper, DocAligner loses detection for a few frames, causing the entire AR overlay to vanish and reappear. We added some holdover code that maintains last-known corners for up to 30 frames with exponential confidence decay, and added hand-aware logic that doesn't count detection misses when a hand is visible (likely occlusion, not loss of paper).

Browser security and cross-device streaming. Getting camera streaming to work across devices surfaced several issues: HTTPS requirements for getUserMedia, Safari UUID generation differences, and WebSocket relay architecture for phone-to-laptop streaming. Each browser had its own quirks.

For our page memory, our first attempt at automatic page indexing and de-deplication storage had to be reverted. The deduplication logic was producing incorrect results, merging pages that shouldn't have been merged.

## Accomplishments that we're proud of

The point-to-ask flow — pointing at an equation on paper, holding for 600ms while a dwell ring fills, and having Gemini explain exactly that equation using the actual handwritten context — was really exciting to see come to life. It marked the homography math, gesture state machine, segment hit-testing, and confidence gating finally coming together, and took a serious amount of debugging. Thank the heavens for LLMs.

We’re also proud of the quality-aware page upgrades using a Cobb-Douglas multiplicative quality model… pages progressively improve as students re-scan, with full version history preserved.

What we learned

We definitely learned to pivot faster. We spent two days fighting Unity/Meta Quest on-device integration before switching to web.

For document quality scoring, multiplicative scoring beats averaging… Our initial quality model averaged metrics, so a perfectly sharp but completely shadowed image scored "decent." The Cobb-Douglas multiplicative model

$$Q = S^{\alpha} \times L^{\beta} \times (1 - D)^{\gamma} \times C^{\delta}$$

ensures any single catastrophic dimension tanks the score — which matches human perception.

What's next for GeminEye

AR side pages need to be properly integrated to the main project. We want to show an AR modal on the side of the user’s real page showing what the variables mean and an AR modal on the right side of the user’s actual page showing what the most important definitions mean.

Built With

  • agentdevelopmentkit
  • copious-amounts-of-coffee
  • fastapi
  • gemini
  • geminilive
  • meta
  • nextjs
Share this project:

Updates