Inspiration

The modern web is powerful but still requires manual navigation. Even with search engines and chatbots, users still have to click, scroll, open pages, and extract information manually.

We asked a simple question:

What if an AI agent could see a website like a human and operate it for you?

Inspired by the advances in multimodal AI and agentic systems, we built Universal Web Navigator (UWN) — an AI agent that can observe web pages visually and autonomously navigate them to complete tasks.

This idea aligns strongly with the vision of agent-based computing powered by Google Gemini and Vertex AI.


What It Does

Universal Web Navigator is a live multimodal AI agent that can:

1. Visually Understand Websites

The agent captures real-time screenshots of webpages and sends them to Gemini 3 Flash for multimodal reasoning.

This allows the AI to:

  • Understand page layout
  • Identify buttons, links, and forms
  • Interpret visual context

Just like a human would.


2. Perform Autonomous Web Actions

The AI does not just chat — it acts.

It can:

  • Click buttons
  • Type into forms
  • Navigate links
  • Scroll pages
  • Extract information

All driven by AI reasoning based on visual input.


3. Execute User Missions

Users give the agent a goal, for example:

  • Find the cheapest flight
  • Extract product prices
  • Navigate to a company careers page
  • Collect contact information

The AI then plans the steps and performs the navigation automatically.


4. Run Fully on Google Cloud

The entire application is cloud-native and deployed on Google Cloud infrastructure, ensuring scalability and real-time operation.

Live deployment:

Development https://ais-dev-ogzzs5mayebfedrbqu3mj4-21713248537.asia-southeast1.run.app

Production https://ais-pre-ogzzs5mayebfedrbqu3mj4-21713248537.asia-southeast1.run.app

The .run.app domain verifies deployment on Google Cloud Run.


How We Built It

The system combines AI reasoning, browser automation, and cloud infrastructure.

Frontend

  • React
  • Vite
  • TypeScript
  • Interactive mission dashboard

The interface allows users to:

  • Define navigation goals
  • Monitor AI actions
  • View mission progress

Backend

Node.js + Express powers the automation layer.

This layer handles:

  • Screenshot capture
  • Browser control
  • Action execution
  • Error handling

AI Reasoning Engine

The core intelligence is powered by Google Gemini 3 Flash through the @google/genai SDK.

Capabilities used:

  • Multimodal reasoning
  • Screenshot interpretation
  • Action planning
  • Context-aware navigation

Google Cloud Infrastructure

The project is fully integrated with Google Cloud services.

Key services used:

Google Cloud Run

  • Serverless container hosting
  • Scalable backend infrastructure
  • Runs the browser automation agent

Vertex AI (Gemini)

  • Multimodal AI reasoning
  • Decision making
  • Task planning

Google Search Grounding

Used to:

  • Verify information
  • Provide real-time context
  • Improve agent accuracy

Challenges We Ran Into

1. Translating Visual Context into Actions

Websites are visually complex.

The challenge was enabling the AI to interpret screenshots and translate them into precise browser actions.

This required carefully structuring prompts and reasoning loops.


2. Real-Time Agent Control

Building a live agent that observes → reasons → acts required synchronizing:

  • screenshot capture
  • AI reasoning
  • browser automation

without introducing latency.


3. Handling Website Variability

Every website has a different structure, layout, and UI patterns.

The system needed to be generic enough to work across any website, not just predefined ones.


4. Cloud Deployment

Running browser automation in a serverless cloud environment required containerizing the application and optimizing it for Google Cloud Run's stateless architecture.


Accomplishments That We're Proud Of

1. True Multimodal AI Agent

Unlike traditional chatbots, UWN is a live AI agent that can see and interact with websites.


2. Fully Cloud-Native Deployment

The system runs completely on Google Cloud infrastructure, demonstrating scalable AI agent architecture.


3. Real-World Use Case

This technology can power:

  • AI personal assistants
  • automated research tools
  • workflow automation
  • accessibility solutions

4. Seamless AI + Browser Automation

We successfully integrated:

  • Gemini multimodal reasoning
  • automated browser navigation
  • real-time task execution

into a unified agent system.


What We Learned

During development we gained insights into:

  • Designing multimodal AI workflows
  • Building agentic systems that plan and execute tasks
  • Deploying AI applications on Google Cloud serverless architecture
  • Managing real-time AI reasoning loops

We also learned that visual understanding dramatically improves AI navigation capabilities compared to text-only approaches.


What's Next for Universal Web Navigator

We plan to expand UWN into a full autonomous web assistant.

Future improvements include:

Real-Time Agent Streaming

Users will be able to watch the AI navigate websites live through a video stream.


Voice Interaction

Integration of real-time speech input and output, allowing users to control the agent conversationally.


Memory and Task History

Using Firebase and Firestore to store:

  • user missions
  • agent decisions
  • learning history

Advanced Agent Planning

Integrating multi-step planning agents capable of completing complex workflows across multiple websites.


Enterprise Automation

UWN could become a platform for:

  • automated data collection
  • digital workflow automation
  • AI-driven business research
Share this project:

Updates