Gemini Live Agent Challenge — UI Navigator category
Watches your active browser tab via screenshots, uses Gemini 2.0 Flash multimodal vision to infer your intent, and surfaces relevant information before you ask — as floating cards in the bottom-right corner of every page.
Demo scenario: User opens the USCIS biometrics rescheduling page → the assistant surfaces community insights about wait times. User selects "San Francisco" → the assistant surfaces nearby points of interest and practical logistics.
The extension runs entirely passively — no button to click, no query to type.
- On every tab switch, page load, or significant DOM mutation,
background.jscaptures a screenshot of the active tab - It bundles the screenshot with the page URL, title, optional geolocation, and up to 5 recent browsing history entries, then POSTs to
/analyze - Vision (
vision.py) — Gemini 2.0 Flash sees the screenshot and returns{ intent, page_state, confidence } - If
confidence >= 0.3, the ADK agent (agent.py) usesgoogle_searchto find relevant information and returns up to 3 cards - Cards are forwarded to
content.jsand rendered as dismissable tiles bottom-right. Dismissed cards don't reappear for that URL
Triggers that fire an analysis:
- Tab activated or switched
- Page load completes (
status === "complete") - DOM mutations settle (debounced 1.5s) — catches form interactions and SPA navigation
- Alarm poll (every ~1 minute)
Skipped automatically: chrome:// pages, chrome-extension:// pages, and pages where the screenshot hash hasn't changed since the last send.
git clone https://github.com/likhitjuttada/intentions.ai.git
cd intentions.aiOpen extension/background.js and set line 4:
const BACKEND_URL = "https://YOUR_CLOUD_RUN_URL";Then:
- Open
chrome://extensions/ - Enable Developer mode (top-right toggle)
- Click Load unpacked → select the
extension/folder - Any time you change
BACKEND_URL, click the reload icon on the extension card for the change to take effect
Navigate to any public webpage — tiles appear bottom-right within ~5 seconds if Gemini infers your intent with confidence >= 0.3. On blank tabs, chrome:// pages, or ambiguous pages, nothing is shown.
cd backend
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txt
echo "GOOGLE_API_KEY=your-key-here" > .env
uvicorn main:app --reload --port 8080Replace YOUR_CLOUD_RUN_URL with your deployed service URL in all commands below.
curl https://YOUR_CLOUD_RUN_URL/healthExpected:
{"status":"ok"}This sends a 1×1 PNG with a realistic URL. Vision will return low confidence on a blank image, but it confirms the full pipeline (API key → vision → confidence check → response) is working:
curl -s -X POST https://YOUR_CLOUD_RUN_URL/analyze \
-H "Content-Type: application/json" \
-d '{
"screenshot": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg==",
"url": "https://my.uscis.gov/appointments/biometrics",
"title": "USCIS - Reschedule Biometrics Appointment"
}'Expected shape (confidence may vary):
{
"intent": "...",
"page_state": "...",
"confidence": 0.3,
"cards": []
}curl -s -X POST https://YOUR_CLOUD_RUN_URL/analyze \
-H "Content-Type: application/json" \
-d '{
"screenshot": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg==",
"url": "https://my.uscis.gov/appointments/biometrics",
"title": "USCIS - Reschedule Biometrics Appointment",
"geolocation": { "lat": 37.7749, "lng": -122.4194 },
"recent_history": [
{ "url": "https://reddit.com/r/USCIS", "title": "USCIS wait times", "visitCount": 3 }
]
}'Send a blank URL with no title — confidence should come back below 0.3 and cards should be [] with the agent never called:
curl -s -X POST https://YOUR_CLOUD_RUN_URL/analyze \
-H "Content-Type: application/json" \
-d '{
"screenshot": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg==",
"url": "about:blank",
"title": ""
}'Expected: "cards": []
- Load the extension with
BACKEND_URLpointing to your Cloud Run URL - Open DevTools → Application → Service Workers (or click "service worker" link in
chrome://extensions/) - Navigate to
https://my.uscis.govor any complex public page - Within ~5 seconds, check the Console tab in the service worker DevTools — you should see a successful POST to
/analyze - Tiles appear bottom-right if the agent returns cards
- Google Cloud SDK installed and authenticated
- A GCP project with billing enabled
- Gemini API key stored in Secret Manager as
google-api-key
gcloud config set project YOUR_PROJECT_ID
# Enable required APIs
gcloud services enable run.googleapis.com cloudbuild.googleapis.com \
containerregistry.googleapis.com secretmanager.googleapis.com
# Store API key
echo -n "YOUR_GEMINI_API_KEY" | gcloud secrets create google-api-key --data-file=-
# Grant Cloud Build access to the secret
PROJECT_NUMBER=$(gcloud projects describe $(gcloud config get-value project) --format='value(projectNumber)')
gcloud secrets add-iam-policy-binding google-api-key \
--member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
--role="roles/secretmanager.secretAccessor"gcloud builds submit --config cloudbuild.yaml \
--substitutions=COMMIT_SHA=$(git rev-parse --short HEAD)gcloud run services describe ui-navigator-backend \
--region us-central1 \
--format 'value(status.url)'Set this URL as BACKEND_URL in extension/background.js and reload the extension.
{"status": "ok"}Request:
{
"screenshot": "<base64 PNG>",
"url": "https://my.uscis.gov/...",
"title": "USCIS - Reschedule Appointment",
"geolocation": { "lat": 37.7749, "lng": -122.4194 },
"recent_history": [
{ "url": "https://reddit.com/r/USCIS", "title": "USCIS tips", "visitCount": 2 }
]
}geolocation and recent_history are optional — omitting them degrades card relevance but doesn't break anything.
Response:
{
"intent": "rescheduling a USCIS biometrics appointment",
"page_state": "User is on the appointment selection form, no city selected yet",
"confidence": 0.92,
"cards": [
{
"id": "uscis-reschedule-tips",
"title": "USCIS Biometrics Rescheduling Tips",
"summary": "Reddit users report slots open Tues mornings. Bring original appointment notice + ID.",
"icon": "📋",
"link": "https://reddit.com/r/USCIS"
}
]
}If confidence < 0.3, cards is [] and the ADK agent is never called.
| Layer | Technology |
|---|---|
| Chrome Extension | Manifest V3, vanilla JS |
| Backend | Python FastAPI on Cloud Run |
| Vision | Gemini 2.0 Flash (multimodal — screenshot → intent) |
| Agent | Google ADK (google-adk) with google_search tool |
| CI/CD | Cloud Build (cloudbuild.yaml) |
| Secrets | Google Secret Manager |
intentions.ai/
├── extension/
│ ├── manifest.json # MV3 manifest — permissions: tabs, scripting, history, alarms, geolocation
│ ├── background.js # service worker — screenshot capture, history, backend calls
│ ├── content.js # overlay tile system — render, dismiss, DOM mutation detection
│ └── overlay.css # tile styles
├── backend/
│ ├── main.py # FastAPI /health + /analyze endpoints
│ ├── vision.py # Gemini 2.0 Flash: screenshot → { intent, page_state, confidence }
│ ├── agent.py # ADK root agent: intent → google_search → cards[]
│ └── requirements.txt
├── Dockerfile
├── cloudbuild.yaml
└── README.md
- Gemini multimodal —
vision.pysends the raw PNG screenshot togemini-2.0-flash; intent is inferred visually, not from DOM scraping - Google ADK —
agent.pyusesgoogle-adkAgent+Runnerwith the built-ingoogle_searchtool - Google Cloud Run —
Dockerfile+cloudbuild.yamldeploy to Cloud Run inus-central1 - Secret Manager — API key injected at runtime via
--update-secrets, never baked into the image
