Inspiration

Most UI code generators fail because they only read HTML or only read screenshots. Real interfaces need both structure and visual understanding. I built UIHarvest to solve that gap: an agent that can see a webpage, navigate it, identify reusable UI patterns, and convert them into practical design memory for developers.

What it does

UIHarvest is a multimodal UI Navigator agent that:

  • Crawls and navigates real websites
  • Captures viewport screenshots and DOM context
  • Uses Gemini vision to classify UI components (cards, nav, forms, buttons, sections, etc.)
  • Deduplicates repeated patterns across scroll depth/viewports
  • Extracts design tokens (colors, typography, spacing, radius, shadows)
  • Produces structured outputs and reusable design memory artifacts for implementation workflows

In short: it turns any website into structured, reusable UI intelligence.

How we built it

Agent workflow

  1. Browser automation navigates the target site and captures screenshots.
  2. A DOM heuristic pass proposes candidate UI components.
  3. Candidate boxes are overlaid and sent to Gemini multimodal for semantic classification.
  4. The pipeline crops, labels, deduplicates, and groups repeated pattern instances.
  5. A memory generation phase creates implementation-friendly outputs.

Core stack

  • TypeScript + Bun runtime
  • Playwright for browser extraction
  • Gemini via Google GenAI SDK (@google/genai)
  • Express backend
  • Cloud-native storage/state using Google Cloud services

Cloud architecture

  • Cloud Run for backend hosting
  • Firestore for job metadata/state
  • Cloud Storage for artifacts/screenshots
  • Cloud Build for containerized deployment pipeline

Challenges we ran into

  • Component over-detection: raw DOM candidates included wrappers/noise. Solved with containment deduplication, confidence thresholds, and Gemini semantic filtering.

  • Cross-viewport duplication: sticky/fixed elements and scroll overlap created repeats. Solved by hiding sticky elements on later passes, IoU deduplication, and pattern-signature matching.

  • Noisy real-world pages: cookie popups, lazy-loading, and animation instability hurt extraction consistency. Solved with pre-extraction stabilization (overlay dismissal, lazy-load triggering, animation disabling).

  • Reliable structured outputs from vision: multimodal classification can vary. Solved with strict JSON prompting and defensive parsing/fallback strategies.

What we learned

  • Multimodal agents are strongest when combining visual evidence + structural context, not either one alone.
  • Agent architecture quality (dedup, retries, error boundaries, grounding) matters as much as model quality.
  • Google Cloud services made it straightforward to move from local prototype to reproducible hosted pipeline.
  • Building an evaluation-friendly system (artifacts, architecture visibility, deterministic steps) is critical for hackathon judging.

Why this matters

UIHarvest helps teams reverse-engineer design systems, audit visual consistency, and accelerate implementation from existing interfaces. It demonstrates practical multimodal agency: understanding visual UI in context and producing executable, useful outputs.

Built With

  • bun
  • docker
  • express.js
  • gemini-multimodal-models
  • google-cloud-build
  • google-cloud-firestore
  • google-cloud-run
  • google-cloud-storage-(gcs)
  • google-genai-sdk-(@google/genai)
  • playwright
  • sharp
  • typescript
  • zod
Share this project:

Updates