UIHarvest Live Navigator

Inspiration

Most UI code generators fail because they only read HTML or only read screenshots. Real interfaces need both structure and visual understanding. I built UIHarvest to solve that gap: an agent that can see a webpage, navigate it, identify reusable UI patterns, and convert them into practical design memory for developers.

What it does

UIHarvest is a multimodal UI Navigator agent that:

Crawls and navigates real websites
Captures viewport screenshots and DOM context
Uses Gemini vision to classify UI components (cards, nav, forms, buttons, sections, etc.)
Deduplicates repeated patterns across scroll depth/viewports
Extracts design tokens (colors, typography, spacing, radius, shadows)
Produces structured outputs and reusable design memory artifacts for implementation workflows

In short: it turns any website into structured, reusable UI intelligence.

How we built it

Agent workflow

Browser automation navigates the target site and captures screenshots.
A DOM heuristic pass proposes candidate UI components.
Candidate boxes are overlaid and sent to Gemini multimodal for semantic classification.
The pipeline crops, labels, deduplicates, and groups repeated pattern instances.
A memory generation phase creates implementation-friendly outputs.

Core stack

TypeScript + Bun runtime
Playwright for browser extraction
Gemini via Google GenAI SDK (@google/genai)
Express backend
Cloud-native storage/state using Google Cloud services

Cloud architecture

Cloud Run for backend hosting
Firestore for job metadata/state
Cloud Storage for artifacts/screenshots
Cloud Build for containerized deployment pipeline

Challenges we ran into

Component over-detection: raw DOM candidates included wrappers/noise. Solved with containment deduplication, confidence thresholds, and Gemini semantic filtering.
Cross-viewport duplication: sticky/fixed elements and scroll overlap created repeats. Solved by hiding sticky elements on later passes, IoU deduplication, and pattern-signature matching.
Noisy real-world pages: cookie popups, lazy-loading, and animation instability hurt extraction consistency. Solved with pre-extraction stabilization (overlay dismissal, lazy-load triggering, animation disabling).
Reliable structured outputs from vision: multimodal classification can vary. Solved with strict JSON prompting and defensive parsing/fallback strategies.

What we learned

Multimodal agents are strongest when combining visual evidence + structural context, not either one alone.
Agent architecture quality (dedup, retries, error boundaries, grounding) matters as much as model quality.
Google Cloud services made it straightforward to move from local prototype to reproducible hosted pipeline.
Building an evaluation-friendly system (artifacts, architecture visibility, deterministic steps) is critical for hackathon judging.

Why this matters

UIHarvest helps teams reverse-engineer design systems, audit visual consistency, and accelerate implementation from existing interfaces. It demonstrates practical multimodal agency: understanding visual UI in context and producing executable, useful outputs.

Built With

bun
docker
express.js
gemini-multimodal-models
google-cloud-build
google-cloud-firestore
google-cloud-run
google-cloud-storage-(gcs)
google-genai-sdk-(@google/genai)
playwright
sharp
typescript
zod

Updates

Jackson Kasi started this project — Mar 16, 2026 03:31 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.