Inspiration
The modern web is powerful but still requires manual navigation. Even with search engines and chatbots, users still have to click, scroll, open pages, and extract information manually.
We asked a simple question:
What if an AI agent could see a website like a human and operate it for you?
Inspired by the advances in multimodal AI and agentic systems, we built Universal Web Navigator (UWN) — an AI agent that can observe web pages visually and autonomously navigate them to complete tasks.
This idea aligns strongly with the vision of agent-based computing powered by Google Gemini and Vertex AI.
What It Does
Universal Web Navigator is a live multimodal AI agent that can:
1. Visually Understand Websites
The agent captures real-time screenshots of webpages and sends them to Gemini 3 Flash for multimodal reasoning.
This allows the AI to:
- Understand page layout
- Identify buttons, links, and forms
- Interpret visual context
Just like a human would.
2. Perform Autonomous Web Actions
The AI does not just chat — it acts.
It can:
- Click buttons
- Type into forms
- Navigate links
- Scroll pages
- Extract information
All driven by AI reasoning based on visual input.
3. Execute User Missions
Users give the agent a goal, for example:
- Find the cheapest flight
- Extract product prices
- Navigate to a company careers page
- Collect contact information
The AI then plans the steps and performs the navigation automatically.
4. Run Fully on Google Cloud
The entire application is cloud-native and deployed on Google Cloud infrastructure, ensuring scalability and real-time operation.
Live deployment:
Development https://ais-dev-ogzzs5mayebfedrbqu3mj4-21713248537.asia-southeast1.run.app
Production https://ais-pre-ogzzs5mayebfedrbqu3mj4-21713248537.asia-southeast1.run.app
The .run.app domain verifies deployment on Google Cloud Run.
How We Built It
The system combines AI reasoning, browser automation, and cloud infrastructure.
Frontend
- React
- Vite
- TypeScript
- Interactive mission dashboard
The interface allows users to:
- Define navigation goals
- Monitor AI actions
- View mission progress
Backend
Node.js + Express powers the automation layer.
This layer handles:
- Screenshot capture
- Browser control
- Action execution
- Error handling
AI Reasoning Engine
The core intelligence is powered by Google Gemini 3 Flash through the @google/genai SDK.
Capabilities used:
- Multimodal reasoning
- Screenshot interpretation
- Action planning
- Context-aware navigation
Google Cloud Infrastructure
The project is fully integrated with Google Cloud services.
Key services used:
Google Cloud Run
- Serverless container hosting
- Scalable backend infrastructure
- Runs the browser automation agent
Vertex AI (Gemini)
- Multimodal AI reasoning
- Decision making
- Task planning
Google Search Grounding
Used to:
- Verify information
- Provide real-time context
- Improve agent accuracy
Challenges We Ran Into
1. Translating Visual Context into Actions
Websites are visually complex.
The challenge was enabling the AI to interpret screenshots and translate them into precise browser actions.
This required carefully structuring prompts and reasoning loops.
2. Real-Time Agent Control
Building a live agent that observes → reasons → acts required synchronizing:
- screenshot capture
- AI reasoning
- browser automation
without introducing latency.
3. Handling Website Variability
Every website has a different structure, layout, and UI patterns.
The system needed to be generic enough to work across any website, not just predefined ones.
4. Cloud Deployment
Running browser automation in a serverless cloud environment required containerizing the application and optimizing it for Google Cloud Run's stateless architecture.
Accomplishments That We're Proud Of
1. True Multimodal AI Agent
Unlike traditional chatbots, UWN is a live AI agent that can see and interact with websites.
2. Fully Cloud-Native Deployment
The system runs completely on Google Cloud infrastructure, demonstrating scalable AI agent architecture.
3. Real-World Use Case
This technology can power:
- AI personal assistants
- automated research tools
- workflow automation
- accessibility solutions
4. Seamless AI + Browser Automation
We successfully integrated:
- Gemini multimodal reasoning
- automated browser navigation
- real-time task execution
into a unified agent system.
What We Learned
During development we gained insights into:
- Designing multimodal AI workflows
- Building agentic systems that plan and execute tasks
- Deploying AI applications on Google Cloud serverless architecture
- Managing real-time AI reasoning loops
We also learned that visual understanding dramatically improves AI navigation capabilities compared to text-only approaches.
What's Next for Universal Web Navigator
We plan to expand UWN into a full autonomous web assistant.
Future improvements include:
Real-Time Agent Streaming
Users will be able to watch the AI navigate websites live through a video stream.
Voice Interaction
Integration of real-time speech input and output, allowing users to control the agent conversationally.
Memory and Task History
Using Firebase and Firestore to store:
- user missions
- agent decisions
- learning history
Advanced Agent Planning
Integrating multi-step planning agents capable of completing complex workflows across multiple websites.
Enterprise Automation
UWN could become a platform for:
- automated data collection
- digital workflow automation
- AI-driven business research
Log in or sign up for Devpost to join the conversation.