Inspiration

The modern web is powerful but still requires manual navigation. Even with search engines and chatbots, users still have to click, scroll, open pages, and extract information manually.

We asked a simple question:

What if an AI agent could see a website like a human and operate it for you?

Inspired by the advances in multimodal AI and agentic systems, we built Universal Web Navigator (UWN) — an AI agent that can observe web pages visually and autonomously navigate them to complete tasks.

This idea aligns strongly with the vision of agent-based computing powered by Google Gemini and Vertex AI.

What It Does

Universal Web Navigator is a live multimodal AI agent that can:

1. Visually Understand Websites

The agent captures real-time screenshots of webpages and sends them to Gemini 3 Flash for multimodal reasoning.

This allows the AI to:

Understand page layout
Identify buttons, links, and forms
Interpret visual context

Just like a human would.

2. Perform Autonomous Web Actions

The AI does not just chat — it acts.

It can:

Click buttons
Type into forms
Navigate links
Scroll pages
Extract information

All driven by AI reasoning based on visual input.

3. Execute User Missions

Users give the agent a goal, for example:

Find the cheapest flight
Extract product prices
Navigate to a company careers page
Collect contact information

The AI then plans the steps and performs the navigation automatically.

4. Run Fully on Google Cloud

The entire application is cloud-native and deployed on Google Cloud infrastructure, ensuring scalability and real-time operation.

Live deployment:

Development https://ais-dev-ogzzs5mayebfedrbqu3mj4-21713248537.asia-southeast1.run.app

Production https://ais-pre-ogzzs5mayebfedrbqu3mj4-21713248537.asia-southeast1.run.app

The .run.app domain verifies deployment on Google Cloud Run.

How We Built It

The system combines AI reasoning, browser automation, and cloud infrastructure.

Frontend

React
Vite
TypeScript
Interactive mission dashboard

The interface allows users to:

Define navigation goals
Monitor AI actions
View mission progress

Backend

Node.js + Express powers the automation layer.

This layer handles:

Screenshot capture
Browser control
Action execution
Error handling

AI Reasoning Engine

The core intelligence is powered by Google Gemini 3 Flash through the @google/genai SDK.

Capabilities used:

Multimodal reasoning
Screenshot interpretation
Action planning
Context-aware navigation

Google Cloud Infrastructure

The project is fully integrated with Google Cloud services.

Key services used:

Google Cloud Run

Serverless container hosting
Scalable backend infrastructure
Runs the browser automation agent

Vertex AI (Gemini)

Multimodal AI reasoning
Decision making
Task planning

Google Search Grounding

Used to:

Verify information
Provide real-time context
Improve agent accuracy

Challenges We Ran Into

1. Translating Visual Context into Actions

Websites are visually complex.

The challenge was enabling the AI to interpret screenshots and translate them into precise browser actions.

This required carefully structuring prompts and reasoning loops.

2. Real-Time Agent Control

Building a live agent that observes → reasons → acts required synchronizing:

screenshot capture
AI reasoning
browser automation

without introducing latency.

3. Handling Website Variability

Every website has a different structure, layout, and UI patterns.

The system needed to be generic enough to work across any website, not just predefined ones.

4. Cloud Deployment

Running browser automation in a serverless cloud environment required containerizing the application and optimizing it for Google Cloud Run's stateless architecture.

Accomplishments That We're Proud Of

1. True Multimodal AI Agent

Unlike traditional chatbots, UWN is a live AI agent that can see and interact with websites.

2. Fully Cloud-Native Deployment

The system runs completely on Google Cloud infrastructure, demonstrating scalable AI agent architecture.

3. Real-World Use Case

This technology can power:

AI personal assistants
automated research tools
workflow automation
accessibility solutions

4. Seamless AI + Browser Automation

We successfully integrated:

Gemini multimodal reasoning
automated browser navigation
real-time task execution

into a unified agent system.

What We Learned

During development we gained insights into:

Designing multimodal AI workflows
Building agentic systems that plan and execute tasks
Deploying AI applications on Google Cloud serverless architecture
Managing real-time AI reasoning loops

We also learned that visual understanding dramatically improves AI navigation capabilities compared to text-only approaches.

What's Next for Universal Web Navigator

We plan to expand UWN into a full autonomous web assistant.

Future improvements include:

Real-Time Agent Streaming

Users will be able to watch the AI navigate websites live through a video stream.

Voice Interaction

Integration of real-time speech input and output, allowing users to control the agent conversationally.

Memory and Task History

Using Firebase and Firestore to store:

user missions
agent decisions
learning history

Advanced Agent Planning

Integrating multi-step planning agents capable of completing complex workflows across multiple websites.

Enterprise Automation

UWN could become a platform for:

automated data collection
digital workflow automation
AI-driven business research

Built With

Updates

Muhammad Qasim Sher Khan Sherwani started this project — Mar 16, 2026 04:36 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.