Inspiration
Most UI code generators fail because they only read HTML or only read screenshots. Real interfaces need both structure and visual understanding. I built UIHarvest to solve that gap: an agent that can see a webpage, navigate it, identify reusable UI patterns, and convert them into practical design memory for developers.
What it does
UIHarvest is a multimodal UI Navigator agent that:
- Crawls and navigates real websites
- Captures viewport screenshots and DOM context
- Uses Gemini vision to classify UI components (cards, nav, forms, buttons, sections, etc.)
- Deduplicates repeated patterns across scroll depth/viewports
- Extracts design tokens (colors, typography, spacing, radius, shadows)
- Produces structured outputs and reusable design memory artifacts for implementation workflows
In short: it turns any website into structured, reusable UI intelligence.
How we built it
Agent workflow
- Browser automation navigates the target site and captures screenshots.
- A DOM heuristic pass proposes candidate UI components.
- Candidate boxes are overlaid and sent to Gemini multimodal for semantic classification.
- The pipeline crops, labels, deduplicates, and groups repeated pattern instances.
- A memory generation phase creates implementation-friendly outputs.
Core stack
- TypeScript + Bun runtime
- Playwright for browser extraction
- Gemini via Google GenAI SDK (
@google/genai) - Express backend
- Cloud-native storage/state using Google Cloud services
Cloud architecture
- Cloud Run for backend hosting
- Firestore for job metadata/state
- Cloud Storage for artifacts/screenshots
- Cloud Build for containerized deployment pipeline
Challenges we ran into
Component over-detection: raw DOM candidates included wrappers/noise. Solved with containment deduplication, confidence thresholds, and Gemini semantic filtering.
Cross-viewport duplication: sticky/fixed elements and scroll overlap created repeats. Solved by hiding sticky elements on later passes, IoU deduplication, and pattern-signature matching.
Noisy real-world pages: cookie popups, lazy-loading, and animation instability hurt extraction consistency. Solved with pre-extraction stabilization (overlay dismissal, lazy-load triggering, animation disabling).
Reliable structured outputs from vision: multimodal classification can vary. Solved with strict JSON prompting and defensive parsing/fallback strategies.
What we learned
- Multimodal agents are strongest when combining visual evidence + structural context, not either one alone.
- Agent architecture quality (dedup, retries, error boundaries, grounding) matters as much as model quality.
- Google Cloud services made it straightforward to move from local prototype to reproducible hosted pipeline.
- Building an evaluation-friendly system (artifacts, architecture visibility, deterministic steps) is critical for hackathon judging.
Why this matters
UIHarvest helps teams reverse-engineer design systems, audit visual consistency, and accelerate implementation from existing interfaces. It demonstrates practical multimodal agency: understanding visual UI in context and producing executable, useful outputs.
Built With
- bun
- docker
- express.js
- gemini-multimodal-models
- google-cloud-build
- google-cloud-firestore
- google-cloud-run
- google-cloud-storage-(gcs)
- google-genai-sdk-(@google/genai)
- playwright
- sharp
- typescript
- zod
Log in or sign up for Devpost to join the conversation.