Inspiration
It started with a frustrating afternoon on the govt website.
I was trying to reschedule a biometrics appointment — a process that sounds simple. Meanwhile, I had a 4-5 Reddit tabs open in the background, manually piecing together what other people had figured out: which office was closest to my location, what to bring, whether rescheduling would require me to go through additional steps
That disconnect struck me. I had all this context — where I was in the process, what I'd already researched, what I was trying to accomplish — and none of it was visible to my browser. Every tool I used was reactive: I had to ask it something before it could help me.
What if the browser already knew what I was trying to do, and just... told me what I needed to know?
That's intentions.ai — a proactive browser assistant that watches what you're doing and surfaces relevant information before you ask.
What I Built
intentions.ai is a Chrome extension paired with a Python backend deployed on Google Cloud Run.
Every few seconds (or whenever the page changes), the extension captures a screenshot of your active tab and sends it to the backend. There, Gemini 2.0 Flash analyzes the image to infer your intent — not by parsing DOM text, but by seeing the page the same way you do. It notices what form you're filling out, what's selected, what error is showing.
That intent gets handed to a Google ADK agent, which decides what to search for and returns up to three contextual info cards. These appear as a floating overlay in the bottom-right corner of your browser — unobtrusive, dismissable, and updated as your context changes.
To make the suggestions even sharper, the extension also pulls your recent browsing history (configurable lookback window, defaults to 1 hour) and sends it alongside the screenshot. If you spent the last hour reading Reddit threads about USCIS wait times before landing on the appointment page, the agent knows — and tailors its searches accordingly.
Stack:
- Chrome Extension (Manifest V3, vanilla JS) — screenshot capture, history access, overlay tiles
- FastAPI backend on Google Cloud Run
- Gemini 2.0 Flash — multimodal vision: screenshot → intent + page state
- Google ADK (
google-adk) — root agent withgoogle_searchtool → contextual cards - Cloud Build +
cloudbuild.yaml— automated CI/CD pipeline - Secret Manager — secure API key injection at deploy time
How I Built It
I worked from the outside in:
1. Extension first. I wired up chrome.tabs.captureVisibleTab() to send screenshots to a local backend endpoint. Before any AI was involved, I validated that the extension↔backend contract worked — the right payload shape, the right CORS headers, the overlay rendering correctly.
2. Vision layer. Once the plumbing was solid, I integrated Gemini multimodal. The key insight here was to treat the screenshot as the ground truth — not the DOM, not the URL, but what a human would actually see. The prompt asks Gemini to describe the user's goal and the current page state, and return structured JSON with a confidence score.
3. Agent layer. With intent in hand, the ADK agent takes over. I gave it a single pre-wired tool — google_search — and let it decide what to search for based on the intent and page state. No hard-coded query templates. The agent reasons about what would be useful and acts accordingly.
4. History context. The last piece was feeding in recent browsing history via chrome.history. This gave the agent a research trail to work with — turning a one-shot intent inference into something that understands the broader arc of what you've been investigating.
5. Deploy. Everything runs on Cloud Run via a single gcloud builds submit command. The API key lives in Secret Manager and gets injected at runtime via --update-secrets — no secrets baked into the image or the build environment.
Challenges
The multimodal API changed under me. The google.generativeai package I started with was deprecated by the time I integrated ADK — ADK itself depends on google-genai, the newer SDK. Switching mid-build meant adapting the generate_content call to use genai.Client and the new types.Content/Part/Blob shapes.
Service worker permissions are strict. Chrome MV3 service workers silently fail if you use an API without declaring its permission. chrome.alarms crashing at startup with a cryptic "Status code 15" error took a minute to trace back to a missing "alarms" entry in the manifest — not an error message you'd immediately associate with a permissions issue.
History access requires careful scoping. chrome.history.search() returns everything, including chrome:// internal pages and the current tab itself. Filtering, deduplicating, and capping the result before it hits the backend payload was important both for cleanliness and for keeping the agent prompt focused.
Debouncing is harder than it looks. A page load, a DOM mutation, a tab switch, and a 5-second poll can all fire within the same second. Getting the debounce logic right — minimum gap between sends, hash-based visual change detection, mutation observer settle time — was the difference between a demo that feels snappy and one that hammers the backend and returns stale results.
Secret injection in Cloud Build is subtle. Passing a secret via secretEnv into a gcloud run deploy --set-env-vars arg doesn't work — the value isn't shell-expanded inside a YAML args array, so Cloud Run ends up with the literal string $GOOGLE_API_KEY. The correct approach is --update-secrets, which tells Cloud Run to pull the secret from Secret Manager directly at container startup. A small config difference with a completely silent failure mode.
ADK agent output isn't always clean JSON. The agent is instructed to return a bare JSON array, and usually does — but occasionally wraps it in markdown fences or prepends a sentence of prose. Relying on json.loads() directly on the raw output meant any deviation caused a 502. The fix was to use a regex to extract the first [...] block from anywhere in the response, then parse that — making the pipeline tolerant of imperfect model output without masking real errors.
What I Learned
The most interesting thing I learned is how much context is already available in the browser that never gets used. The URL, the page title, the screenshot, the browsing history, the geolocation — all of it is sitting there, and most tools ignore it entirely.
Gemini's multimodal capability was genuinely surprising in how well it handles ambiguous UI. A government form with unlabeled dropdowns and fine print — the kind of thing that would be hard to parse programmatically — it reads correctly from a screenshot, the same way a human would.
I also came away with a much better intuition for where to let the agent reason vs. where to hard-code behavior. Early on I was tempted to pre-wire specific search queries for specific page types. The right call was to give the agent good context and a good tool, and get out of the way. The searches it generates are more nuanced than anything I would have written by hand.
What's next for intentions.ai
- Include security features that block sending screenshots of sensitive web pages to the API
- Right now the ever changing tiles on the right can be slightly annoying. Need to provide a better UX
Built With
- fastapi
- google-adk
- javascript
- python
Log in or sign up for Devpost to join the conversation.