Inspiration

Every year, millions of first-generation immigrants and low-literacy adults in the United States face a quiet, invisible barrier: government portals and job applications written in bureaucratic English, with no guidance, no context, and no patience for confusion. I thought about what it feels like to stare at a field labeled "Primary Domicile" or "EIN/Employer Tax ID" when English isn't your first language — when the stakes are high (a job, a benefit, a lifeline) and the form offers nothing but a blinking cursor. A single confusing field is enough to make someone close the tab and give up on something they genuinely need. The inspiration for GuideAlong was simple: what if someone could sit beside that person — see exactly what they see, speak their language, and explain each step in plain, patient words? Not a chatbot they have to type into. Not a help article buried three clicks away. A real-time guide that watches alongside them and speaks up exactly when they need it. When I discovered the Gemini Live API's ability to combine real-time audio conversation with multimodal vision, I realized the technology had finally caught up to that idea.

What it does

GuideAlong is a real-time voice and vision accessibility agent that helps low-literacy adults and first-generation immigrants navigate online forms — without requiring them to type a single query. Here's what happens when a user activates GuideAlong: It watches their screen. Using Gemini's multimodal vision, GuideAlong continuously interprets the user's screen — reading field labels, detecting error messages, and understanding the page context visually, without any DOM access or browser extension. It works on any website, just like a human would. It listens and speaks. Through the Gemini Live API, GuideAlong maintains a persistent bidirectional audio stream. Users speak naturally — asking questions, giving commands, or simply describing their confusion — and GuideAlong responds in spoken plain language, in real time. It detects their language automatically. From the user's first spoken words, GuideAlong identifies their language and switches all guidance to match. No settings, no dropdowns. The demo supports English and Spanish. It speaks up before being asked. When Gemini's vision detects confusion signals — a user hovering on the same field for more than 8 seconds, a visible error message, or prolonged inactivity — GuideAlong proactively offers an explanation without waiting to be prompted. It can fill fields on command. Users can say "write my name as Maria García" and GuideAlong executes the action, turning voice intent into form completion. The result is a fully working end-to-end demo: a Spanish-speaking user named Maria completes a realistic six-field job application form — in her language, with real-time guidance, guided by an agent that sees exactly what she sees.

How I built it

GuideAlong is built on a four-agent architecture orchestrated by Google's ADK (Agent Development Kit), with Gemini at the core of every intelligent decision. The Agent Layer (ADK) Four specialized agents handle distinct responsibilities:

ScreenReaderAgent — sends JPEG screenshots to Gemini Vision and extracts structured context: the active field, its plain-English label, any visible errors, and the page's primary language. GuidanceAgent — takes that field context and generates a 2–3 sentence plain-language explanation, warm in tone, specific to what's on screen. LanguageAgent — detects the user's language from their first utterance and routes all GuidanceAgent output through Gemini for translation. ActionAgent — listens for explicit voice commands and executes form-fill actions via JavaScript injection in the browser.

The Multimodal Pipeline Screen frames are captured at ~1fps via the browser's MediaDevices API and passed through a pixel-diff service that only forwards frames when the screen has changed by more than 5% — keeping Gemini vision calls lean and latency low. The Gemini Live API maintains a persistent WebSocket for bidirectional audio, streaming the user's voice in and GuideAlong's spoken guidance out.

The Stack Google Gemini Live API (Gemini-2.5-flash-native-audio-preview-12-2025) — Used for the bidirectional voice WebSocket
Gemini Vision (Gemini-2.5-flash) — Used for vision tasks (screen analysis, guidance generation, translation) Google ADK for agent orchestration Python FastAPI backend on Google Cloud Run React + TypeScript + Tailwind CSS frontend Terraform for infrastructure as code Google Secret Manager for credentials

A mock six-field job application portal was built specifically for the demo to ensure 100% reliability during the judge video — real government portals change layout frequently and have bot-detection measures that would make a live demo unpredictable.

Challenges I ran into

Building a real-time audio + vision system introduced a class of problems I hadn't fully anticipated — most of them at the intersection of latency and audio state management. The audio bleed problem. The most stubborn challenge was audio feedback between GuideAlong's TTS output and the Gemini Live API's listening input. When GuideAlong spoke guidance aloud, Gemini would pick up its own voice through the microphone stream, interpret it as a new user utterance, and trigger a new guidance response — which would then be picked up again. The agent was effectively talking to itself in a loop. Solving this required implementing a strict "speaking" state lock: while GuideAlong's TTS is active, the Live API input stream is gated closed. I also added a short cooldown buffer after each TTS completion before reopening the mic, to absorb any trailing audio artifacts. TTS firing frequency on a single field. Early versions of the proactive interruption logic were too eager. The inactivity timer would fire, trigger guidance, the user would hear it and pause to process — which the timer interpreted as more inactivity, triggering another guidance message before they'd finished listening to the first. This created an overwhelming, spammy experience. I solved it by treating an active TTS playback as equivalent to user activity: the inactivity clock resets the moment GuideAlong starts speaking and doesn't restart until both the audio has finished and a grace period has elapsed. Gemini Live API WebSocket stability. Under sustained use, the WebSocket connection would occasionally drop without a clean error signal. I implemented reconnection logic with exponential backoff and a session state cache so that when reconnection occurred, GuideAlong could resume with the correct language and page context rather than starting cold. Screen interpretation confidence. Gemini vision occasionally struggled with low-contrast form fields or non-standard UI layouts. Rather than letting it guess and hallucinate a field label, I explicitly instructed the model to surface its uncertainty in the response, which GuideAlong would then convert into a graceful fallback: "I'm having trouble seeing that field clearly — can you tell me what it's asking for?"

Accomplishments that I'm proud of

Technically, the accomplishment I'm most proud of is the audio state management system. The bleed problem was genuinely hard — it required reasoning carefully about the timing boundaries between listening, processing, and speaking states in a persistent real-time stream. Getting it right made the difference between a broken demo and a fluid, natural experience. I'm also proud of the "no DOM access" architectural commitment. It would have been much easier to scrape form field metadata directly from the DOM. Instead, Gemini reads the screen visually — the same way a human helper would — which makes GuideAlong generalizable to any website, not just ones with clean HTML structure. That constraint made the engineering harder and the product more powerful. Finally, completing this as a solo build within a part-time sprint — with a fully deployed Cloud Run backend, IaC scripts, architecture documentation, and a polished demo — is something I'm genuinely proud of.

What I learned

Real-time audio systems have their own physics. Building with the Gemini Live API taught us that latency, state, and timing aren't just performance considerations — they're user experience decisions. The audio bleed problem wasn't a bug, it was a consequence of treating a continuous stream as if it were a request/response system. Understanding the difference changed how I thought about the entire architecture. Proactive UX is harder than reactive UX. It's relatively straightforward to answer a question. It's much harder to know when to speak up without being asked — and to do it without being annoying, intrusive, or wrong. The inactivity timer tuning alone went through five iterations before it felt natural. The constraint is the feature. The UI Navigator category's requirement that Gemini interpret the screen visually — without DOM access — initially felt like a handicap. It turned out to be the reason GuideAlong works on any website, not just ones I've pre-integrated with. The hardest constraint produced the most generalized solution.

What's next for GuideAlong

The current demo covers English and Spanish on a mock job application portal. The vision is much larger. Expanding language support. Gemini's multilingual capabilities mean GuideAlong could realistically support dozens of languages with minimal additional engineering. The next priority languages — based on US immigrant population data — are Mandarin, Vietnamese, Arabic, and Haitian Creole. Real portal support. Working with community organizations and legal aid nonprofits to map and test GuideAlong on the actual portals their clients struggle with most: USCIS benefit applications, state unemployment systems, Medicaid enrollment forms, and public housing applications. Partnerships with digital equity organizations. GuideAlong is most powerful when deployed through organizations that already have trust with underserved communities — libraries, immigrant services nonprofits, workforce development programs. We'd like to build a version of GuideAlong that these organizations can deploy for their clients with minimal technical setup. Accessibility beyond language. The same architecture that helps language-barrier users could serve people with cognitive disabilities, low vision, or motor impairments. The voice-first, vision-grounded design generalizes naturally to those groups. The long-term goal is a GuideAlong that any person, anywhere, can activate on any screen — and immediately have a patient, knowledgeable guide in their language, seeing exactly what they see, ready to help them through whatever form stands between them and something they need.

Built With

Share this project:

Updates