<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: KimSejun</title>
    <description>The latest articles on DEV Community by KimSejun (@combba).</description>
    <link>https://dev.to/combba</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1873529%2Feef51b36-551e-4054-8a85-2ba299011dd6.png</url>
      <title>DEV Community: KimSejun</title>
      <link>https://dev.to/combba</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/combba"/>
    <language>en</language>
    <item>
      <title>submitting vibecat: what 3 weeks of building a desktop AI actually taught me</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Mon, 16 Mar 2026 13:58:17 +0000</pubDate>
      <link>https://dev.to/combba/submitting-vibecat-what-3-weeks-of-building-a-desktop-ai-actually-taught-me-1409</link>
      <guid>https://dev.to/combba/submitting-vibecat-what-3-weeks-of-building-a-desktop-ai-actually-taught-me-1409</guid>
      <description>&lt;h1&gt;
  
  
  submitting vibecat: what 3 weeks of building a desktop AI actually taught me
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge. But this isn't a pitch — it's what I actually learned.&lt;/p&gt;

&lt;p&gt;Three weeks ago, VibeCat was a failed video transcription app called Missless. Today it's a proactive desktop companion that watches your screen, suggests actions before you ask, and moves your mouse to execute them. We pivoted, rebuilt from scratch, shipped to Google Cloud Run, and submitted to Devpost with hours to spare.&lt;/p&gt;

&lt;p&gt;Here's the honest retrospective.&lt;/p&gt;

&lt;h2&gt;
  
  
  the pivot that saved us
&lt;/h2&gt;

&lt;p&gt;Missless was supposed to do real-time video transcription with Gemini. It worked — technically. But the demo was boring. You'd talk, text appeared on screen, end of demo. No one would watch that for 4 minutes.&lt;/p&gt;

&lt;p&gt;The pivot happened on March 4th. We asked ourselves: what if instead of processing video passively, the AI could &lt;em&gt;see&lt;/em&gt; the screen and &lt;em&gt;act&lt;/em&gt; on what it sees? Not a chatbot. Not a voice assistant. A colleague who happens to be a cat.&lt;/p&gt;

&lt;p&gt;VibeCat's core loop came from a single frustrated user request: "LOOK! DECIDE! MOVE! CLICK! VERIFY!" — five words, shouted in all caps. That became the product.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk7101ulasbvw04b84lz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk7101ulasbvw04b84lz.jpg" alt="VibeCat — Your Proactive Desktop Companion" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  the architecture that actually shipped
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;macOS Client (Swift 6.2)
  → WebSocket
    → Realtime Gateway (Go, Cloud Run)
      → Gemini Live API (voice + vision + FC)
      → ADK Orchestrator (screenshot analysis)
      → Firestore (session state)
      → Cloud Logging (observability)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three services. One WebSocket connection. Five function calling tools. That's it.&lt;/p&gt;

&lt;p&gt;The simplicity was intentional. We had 3 weeks. Every architectural decision was "what's the minimum that works reliably?" We considered an event-driven microservices setup with separate vision, NLU, and action services. We considered a local-first architecture with on-device models. We considered both and built neither. One gateway, one orchestrator, one client.&lt;/p&gt;

&lt;p&gt;The pendingFC mechanism — where function calls queue up and execute strictly one at a time with verification between each — was the most important architectural decision. It added latency but eliminated an entire category of bugs where Gemini would fire three actions simultaneously and corrupt the UI state.&lt;/p&gt;

&lt;h2&gt;
  
  
  three scenarios, three different nightmares
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;YouTube Music (S1):&lt;/strong&gt; The hardest. YouTube renders player controls on &lt;code&gt;&amp;lt;canvas&amp;gt;&lt;/code&gt;, invisible to the Accessibility API. Our first approach — keyboard shortcuts — worked but looked robotic. Our second approach — vision-based mouse control — looked incredible but failed 40% of the time because Gemini's coordinate estimates were off by 20+ pixels on Retina displays. The solution: try AX first, fall back to CDP (Chrome DevTools Protocol), then fall back to vision coordinates. Self-healing with max 2 retries. Final success rate: 94%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Enhancement (S2):&lt;/strong&gt; VibeCat reads your code in Antigravity IDE, proactively suggests improving documentation, types "Enhance the comments for this code" into Gemini Chat, and lets the AI rewrite. This was surprisingly stable — 100% success rate after the first week. The trick was using &lt;code&gt;navigate_text_entry&lt;/code&gt; with the AX tree instead of trying to click into the text field.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal Automation (S3):&lt;/strong&gt; VibeCat switches to Terminal, runs &lt;code&gt;go vet ./...&lt;/code&gt;, and verifies the output. Also 100% after stabilization. Terminal is the most AX-friendly app on macOS — every element is properly labeled and positioned.&lt;/p&gt;

&lt;h2&gt;
  
  
  what gemini live API actually enables
&lt;/h2&gt;

&lt;p&gt;I'd used Gemini's regular API before. The Live API is a different experience entirely.&lt;/p&gt;

&lt;p&gt;The killer feature isn't voice or vision individually — it's that they exist in the same session simultaneously. VibeCat can see a screenshot, hear the user say "yeah, fix that," understand both inputs in context, and issue a function call — all in one streaming session with sub-second latency.&lt;/p&gt;

&lt;p&gt;Function Calling over Live API was the primitive that made proactive desktop control possible. Without it, we'd need a separate intent classification step, a separate action planning step, and a separate execution step — each adding latency and losing context. With FC, Gemini does all three in one inference pass.&lt;/p&gt;

&lt;p&gt;The gotcha: Gemini sometimes hallucinates tool usage. It says "I've typed the command" without actually calling the tool. Our inject-text approach had a 40% failure rate from this. The fix was simple but non-obvious — send actual user voice input instead of programmatic text injection. When the user speaks, Gemini takes the FC path; when we inject text, Gemini sometimes takes the "just respond" path.&lt;/p&gt;

&lt;h2&gt;
  
  
  the self-healing engine
&lt;/h2&gt;

&lt;p&gt;Most automation agents fail and stop. VibeCat fails, switches to a different approach, and tries again — all while narrating what it's doing to the user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;🔍 Analyzing screen...
▶️ Clicking play button [AX]
⚠️ Button not found — retrying with CDP
▶️ Clicking play button [CDP]  
⚠️ CDP target unavailable — retrying with Vision
▶️ Moving cursor to (847, 423) [Vision]
✅ Music is playing!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three grounding sources (Accessibility API, Chrome DevTools Protocol, Vision coordinates), max 2 retries, post-action verification via ADK screenshot analysis. The transparent narration turned out to be more important than the retry logic itself — users who see the recovery process trust the system. Users who see silent failure don't.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvg921iixvb88879fxzj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvg921iixvb88879fxzj.jpg" alt="Enhanced code with Gemini analysis panel" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F78n4vzcdk1fmybi0jb6j.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F78n4vzcdk1fmybi0jb6j.jpg" alt="System architecture diagram" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu0mxoptbmoj9fkvfcmzx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu0mxoptbmoj9fkvfcmzx.jpg" alt="Running on Google Cloud Platform" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  the demo video pipeline
&lt;/h2&gt;

&lt;p&gt;The demo video deserves its own post, but briefly: we built a fully automated pipeline with Gemini TTS (Zephyr voice for the cat, Charon for narration), MiniMax cloned voice for the human narrator, background music from the actual YouTube Music song played in the demo, and ffmpeg for composition. The dubbing script is a JSON file with millisecond-precision timestamps. Running one shell script regenerates the entire video from source clips.&lt;/p&gt;

&lt;h2&gt;
  
  
  numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;17 dev.to devlogs&lt;/strong&gt; published during the challenge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 demo scenarios&lt;/strong&gt; all passing E2E&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5 FC tools&lt;/strong&gt; for desktop control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;80+ key codes&lt;/strong&gt; mapped in AccessibilityNavigator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 grounding sources&lt;/strong&gt; with automatic fallback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10+ Cloud Run deployments&lt;/strong&gt; during the final week&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;144 seconds&lt;/strong&gt; of demo video, fully dubbed with 3 distinct voices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 cat&lt;/strong&gt; who never sleeps&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  what I'd tell someone starting a similar project
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with the demo, not the architecture.&lt;/strong&gt; We spent the first week building infrastructure and the last week desperately trying to make it demo-ready. If I could restart, I'd record a fake demo video on day one and work backward from "what needs to be real."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proactive AI is harder than reactive AI.&lt;/strong&gt; It's easy to make an agent that responds to commands. It's hard to make one that speaks up at the right moment without being annoying. The confirmation gate — always waiting for user approval — was the single most important UX decision. It makes proactive feel safe instead of scary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Narrate everything.&lt;/strong&gt; Silent processing feels broken. Transparent processing feels collaborative. Show the user what the AI is doing, why it's doing it, which tool it's using, and whether it succeeded. The overlay panel cost us two days to build and was worth every hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini Live API + Function Calling is genuinely powerful.&lt;/strong&gt; Real-time multimodal input with structured tool invocation in a single session — this combination enables interaction patterns that weren't possible before. It's not perfect (hallucinated tool calls are real), but it's the right foundation for desktop AI agents.&lt;/p&gt;

&lt;p&gt;VibeCat started as a joke name. "Vibe coding, but with a cat." Three weeks later, the cat watches your screen, suggests improvements, moves your mouse, and verifies its own work. It's still a cat. But now it's a pretty capable one.&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>go</category>
    </item>
    <item>
      <title>when your agent fails, does it just... stop?</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Mon, 16 Mar 2026 13:57:37 +0000</pubDate>
      <link>https://dev.to/combba/when-your-agent-fails-does-it-just-stop-5c77</link>
      <guid>https://dev.to/combba/when-your-agent-fails-does-it-just-stop-5c77</guid>
      <description>&lt;h1&gt;
  
  
  when your agent fails, does it just... stop?
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge. But this particular problem — what happens when an AI action fails — is something every agent builder needs to solve.&lt;/p&gt;

&lt;p&gt;Most desktop automation tools have a dirty secret: they're fragile. Click the wrong pixel, target an element that moved, or encounter an unexpected dialog — and the whole sequence collapses. The user sees "Error" and reaches for the keyboard.&lt;/p&gt;

&lt;p&gt;VibeCat's self-healing engine was built because we got tired of watching our cat give up.&lt;/p&gt;

&lt;h2&gt;
  
  
  the failure taxonomy
&lt;/h2&gt;

&lt;p&gt;After running hundreds of test sequences across three apps (Antigravity IDE, Terminal, Chrome), we cataloged the failure modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AX target not found&lt;/strong&gt; — The Accessibility API says the element doesn't exist. Usually because the app hasn't finished rendering, or because the element is inside a canvas/WebGL surface. Frequency: ~15% of first attempts on Chrome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AX target found but wrong&lt;/strong&gt; — The element exists but it's the wrong one. A "Play" button that's actually in a different panel, or a text field that looks right but belongs to a different component. Frequency: ~5%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Click landed but nothing happened&lt;/strong&gt; — The coordinates were correct, the click fired, but the UI didn't respond. Common with YouTube's debounced event handlers. Frequency: ~10% on YouTube Music.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action succeeded but verification failed&lt;/strong&gt; — VibeCat typed the text and it appeared, but the post-action screenshot shows an error dialog or unexpected state. Frequency: ~3%.&lt;/p&gt;

&lt;h2&gt;
  
  
  max 2 retries, alternative grounding
&lt;/h2&gt;

&lt;p&gt;The self-healing engine is deliberately simple. No complex state machines, no machine learning. Just two rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Max 2 retries per step.&lt;/strong&gt; If it fails three times, stop and tell the user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Each retry uses a different grounding source.&lt;/strong&gt; Don't repeat what already failed.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attempt 1: AX targeting
  → Failed: element not in AX tree

Attempt 2: CDP targeting (chromedp)
  → Failed: Chrome DevTools can't find matching DOM node

Attempt 3: Vision coordinates (Gemini screenshot analysis)
  → Success: clicked at (847, 423), verification passed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The grounding source priority chain is AX → CDP → Vision. But the engine is smart enough to skip sources that don't apply — if you're in Terminal (no browser), CDP is skipped entirely.&lt;/p&gt;

&lt;p&gt;Here's the core logic in &lt;code&gt;handler.go&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;executeWithHealing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;GroundingSource&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;AX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CDP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Vision&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;maxRetries&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

        &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executeStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;verified&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verifyErr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verifyStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verifyErr&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;verified&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;emitProcessingState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"retrying_step"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"self-healing retry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"step"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"attempt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="s"&gt;"failed_source"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"next_source"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"step %s failed after %d attempts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxRetries&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  vision verification: the trust layer
&lt;/h2&gt;

&lt;p&gt;Every action — whether it's typing text, clicking a button, or opening a URL — ends with a verification step. VibeCat captures a fresh screenshot and sends it to the ADK Orchestrator with a specific question: "Did the action succeed?"&lt;/p&gt;

&lt;p&gt;This isn't just "did the click register?" It's semantic verification:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;After typing "go vet ./..." in Terminal → verify the command output shows "no issues"&lt;/li&gt;
&lt;li&gt;After clicking Play on YouTube Music → verify the video element is no longer paused&lt;/li&gt;
&lt;li&gt;After opening a URL → verify the expected page content is visible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ADK Orchestrator uses Gemini's vision model for this analysis. It returns a confidence score and a natural-language explanation. If confidence is below the threshold, the step is marked as failed and healing kicks in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;verification:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"explanation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The play button appears unchanged. 
                  The video progress bar has not moved."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;trigger&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;retry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;CDP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;grounding&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  the pendingFC queue: no racing allowed
&lt;/h2&gt;

&lt;p&gt;One subtle failure mode we discovered: Gemini sometimes issues multiple function calls in rapid succession. "Focus Terminal, then type &lt;code&gt;go vet ./...&lt;/code&gt;, then press Enter." If these execute in parallel, &lt;code&gt;go vet&lt;/code&gt; might get typed into the wrong window because &lt;code&gt;focus_app&lt;/code&gt; hasn't completed yet.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;pendingFC&lt;/code&gt; mechanism solves this with strict sequential execution:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Gemini sends FC calls → queued in &lt;code&gt;pendingFC&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Gateway sends step 1 to client&lt;/li&gt;
&lt;li&gt;Client executes, captures verification screenshot&lt;/li&gt;
&lt;li&gt;Gateway confirms step 1 → sends step 2&lt;/li&gt;
&lt;li&gt;Repeat until queue is empty&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No step starts until the previous step's verification passes. This adds latency (~200ms per step for verification) but eliminates an entire class of race condition bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  transparent narration: failures feel collaborative
&lt;/h2&gt;

&lt;p&gt;The most impactful design decision wasn't technical — it was UX. VibeCat narrates every step through the overlay panel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔍 Reading screen...
📋 Planning 3 steps
▶️ Step 1/3: Focusing Terminal [AX]
⚠️ Retrying Step 1 — switching to CDP
✅ Step 1/3: Terminal focused
▶️ Step 2/3: Typing command...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users who watched VibeCat fail silently reported it as "broken." Users who watched the same failure with narration reported it as "working through a problem." Same outcome, completely different perception.&lt;/p&gt;

&lt;p&gt;The seven processing stages (&lt;code&gt;analyzing_command&lt;/code&gt;, &lt;code&gt;planning_steps&lt;/code&gt;, &lt;code&gt;executing_step&lt;/code&gt;, &lt;code&gt;verifying_result&lt;/code&gt;, &lt;code&gt;retrying_step&lt;/code&gt;, &lt;code&gt;completing&lt;/code&gt;, &lt;code&gt;observing_screen&lt;/code&gt;) each have localized labels in English, Korean, and Japanese. The overlay shows a grounding source badge (AX / Vision / Hotkey / System) so you always know &lt;em&gt;how&lt;/em&gt; VibeCat is interacting with your screen.&lt;/p&gt;

&lt;h2&gt;
  
  
  numbers that matter
&lt;/h2&gt;

&lt;p&gt;After implementing self-healing, our end-to-end success rates across 50 test runs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Without healing&lt;/th&gt;
&lt;th&gt;With healing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;YouTube Music play&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code comment enhancement&lt;/td&gt;
&lt;td&gt;88%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal go vet&lt;/td&gt;
&lt;td&gt;91%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The remaining 6% failure on YouTube Music is almost entirely due to network latency — the page hasn't finished loading when VibeCat tries to click. A simple "wait for page ready" check would probably push it to 98%+.&lt;/p&gt;

&lt;h2&gt;
  
  
  what I learned
&lt;/h2&gt;

&lt;p&gt;Self-healing isn't about being clever. It's about being systematic. Catalog your failures, build a fallback chain, verify every step, and tell the user what's happening. The hard part isn't the retry logic — it's the verification. Without reliable post-action verification, you're just clicking blindly and hoping.&lt;/p&gt;

&lt;p&gt;And narrate everything. Always narrate everything. Silent AI feels broken. Transparent AI feels collaborative.&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>go</category>
    </item>
    <item>
      <title>teaching a cat to use a mouse — literally</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Mon, 16 Mar 2026 10:24:28 +0000</pubDate>
      <link>https://dev.to/combba/teaching-a-cat-to-use-a-mouse-literally-5go1</link>
      <guid>https://dev.to/combba/teaching-a-cat-to-use-a-mouse-literally-5go1</guid>
      <description>&lt;h1&gt;
  
  
  teaching a cat to use a mouse — literally
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge, and honestly this was the feature that almost broke us.&lt;/p&gt;

&lt;p&gt;Our user's feedback was blunt: "Why aren't you using vision to control the mouse directly?" And then, more specifically: "The cursor should glide smoothly, find its target visually, move again, and click — that's the WOW factor."&lt;/p&gt;

&lt;p&gt;He was right. Sending keyboard shortcuts and accessibility API calls is reliable, but it looks like a script running. A cursor that &lt;em&gt;glides&lt;/em&gt; across the screen, finds its target visually, and clicks — that looks like intelligence.&lt;/p&gt;

&lt;p&gt;So we built the LOOK → DECIDE → MOVE → CLICK → VERIFY pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  the five-stage pipeline
&lt;/h2&gt;

&lt;p&gt;Here's what happens when VibeCat decides to click something on your screen:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LOOK&lt;/strong&gt; — VibeCat captures a screenshot via ScreenCaptureKit. This isn't a polling loop; it's triggered when the gateway's proactive companion decides an action is needed. The screenshot goes to Gemini's vision model along with the current AX (Accessibility) snapshot for context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DECIDE&lt;/strong&gt; — Gemini analyzes the screenshot and returns a target. This could be "the Play button on YouTube Music at approximately (847, 423)" or "the text field labeled 'Search' in the Antigravity IDE sidebar." The key insight: we don't just get coordinates. We get a semantic description of &lt;em&gt;what&lt;/em&gt; to click and &lt;em&gt;why&lt;/em&gt;, which feeds into the transparent feedback overlay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MOVE&lt;/strong&gt; — &lt;code&gt;animateCursorTo&lt;/code&gt; in &lt;code&gt;AccessibilityNavigator.swift&lt;/code&gt; smoothly interpolates the cursor position over ~300ms using a cubic easing curve. This is purely cosmetic but it's what makes VibeCat feel like a colleague reaching for the mouse rather than a teleporting robot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;animateCursorTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGPoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TimeInterval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;NSEvent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mouseLocation&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// 60fps&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;eased&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// smoothstep&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;eased&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;eased&lt;/span&gt;
        &lt;span class="kt"&gt;CGEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;mouseEventSource&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;mouseType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mouseMoved&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="nv"&gt;mouseCursorPosition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGPoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="nv"&gt;mouseButton&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;left&lt;/span&gt;&lt;span class="p"&gt;)?&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;tap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cghidEventTap&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="kt"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;forTimeInterval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CLICK&lt;/strong&gt; — A CGEvent mouse click at the current cursor position. Simple, but the timing matters — we add a 50ms delay after the final move to let the OS register the cursor position before clicking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERIFY&lt;/strong&gt; — Another screenshot capture, sent to the ADK Orchestrator for vision analysis. "Did the button state change? Is the expected content now visible?" If verification fails, the self-healing engine kicks in with an alternative grounding strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  three grounding sources, one fallback chain
&lt;/h2&gt;

&lt;p&gt;The real complexity isn't in clicking — it's in &lt;em&gt;finding the right thing to click&lt;/em&gt;. VibeCat uses three grounding sources in priority order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Accessibility API (AX)&lt;/strong&gt; — The gold standard. macOS exposes UI elements with roles, labels, and positions. When it works, it's pixel-perfect. But YouTube Music renders its player controls on a &lt;code&gt;&amp;lt;canvas&amp;gt;&lt;/code&gt; element — completely invisible to AX.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chrome DevTools Protocol (CDP)&lt;/strong&gt; — For browser elements AX can't see. Our Go gateway runs &lt;code&gt;chromedp&lt;/code&gt; to query DOM elements, get bounding boxes, and execute JavaScript. This catches most canvas-rendered controls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Vision coordinates&lt;/strong&gt; — The last resort. Send a screenshot to Gemini, ask "where is the play button?", get approximate pixel coordinates. Less reliable, but it works on literally anything visible on screen.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The self-healing engine (max 2 retries) walks down this chain automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1: Try AX targeting
  → Failed (element not found in AX tree)
Step 2: Try CDP targeting  
  → Failed (Chrome not exposing this element via CDP)
Step 3: Try vision coordinates
  → Got (847, 423), move cursor, click
  → Verify: screenshot shows music is now playing ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hb9v2bqqiyg7551kyhw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hb9v2bqqiyg7551kyhw.jpg" alt="Cat confirms music is playing" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  the YouTube Music problem
&lt;/h2&gt;

&lt;p&gt;YouTube Music was our hardest surface. The player controls are canvas-rendered, the site is a single-page app that mutates state without URL changes, and the search results list doesn't expose individual items as clickable AX elements.&lt;/p&gt;

&lt;p&gt;Our solution was multi-layered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open YouTube Music via &lt;code&gt;navigate_open_url&lt;/code&gt; with the search query pre-filled in the URL&lt;/li&gt;
&lt;li&gt;Wait for results to load (vision verification of the page state)&lt;/li&gt;
&lt;li&gt;Use vision to find the target song/playlist&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;animateCursorTo&lt;/code&gt; to the result&lt;/li&gt;
&lt;li&gt;Click via CGEvent&lt;/li&gt;
&lt;li&gt;Verify playback started via CDP &lt;code&gt;document.querySelector('video').paused === false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If verification fails, fallback to &lt;code&gt;video.play()&lt;/code&gt; via JavaScript injection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We ran this sequence 5 times consecutively in our rehearsal protocol. It passed every time — but only after we added the &lt;code&gt;video.play()&lt;/code&gt; fallback. Pure vision-based clicking had about a 60% success rate on first attempt because Gemini's coordinate estimates were sometimes off by 20-30 pixels.&lt;/p&gt;

&lt;h2&gt;
  
  
  80 key codes and counting
&lt;/h2&gt;

&lt;p&gt;Beyond mouse control, &lt;code&gt;AccessibilityNavigator.swift&lt;/code&gt; maps 80+ macOS key codes for keyboard automation. Things like &lt;code&gt;Cmd+Shift+5&lt;/code&gt; to start screen recording, &lt;code&gt;Cmd+Tab&lt;/code&gt; to switch apps, or &lt;code&gt;Ctrl+A&lt;/code&gt; to select all text in Terminal. Each key code was manually verified across our three gold-tier surfaces: Antigravity IDE, Terminal, and Chrome.&lt;/p&gt;

&lt;p&gt;The overlay panel shows all of this in real time — which grounding source is being used, which step of the pipeline we're in, and whether the last verification passed or failed. Users never see a black box. They see VibeCat &lt;em&gt;working&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  what I'd do differently
&lt;/h2&gt;

&lt;p&gt;Honestly? I'd invest more in vision coordinate calibration. The 20-30 pixel offset on Retina displays cost us hours of debugging. We eventually solved it by preferring semantic AX targeting wherever possible and only falling back to raw coordinates as a last resort. But if we'd built a proper coordinate calibration system (test click → verify → adjust offset) from day one, the vision path would have been much more reliable.&lt;/p&gt;

&lt;p&gt;The cursor animation, though? That was worth every line of code. When VibeCat smoothly moves the mouse to a YouTube search result and clicks it — people's eyes light up. That's the moment it stops being a demo and starts feeling like the future.&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>macos</category>
    </item>
    <item>
      <title>the moment vibecat stopped waiting and started suggesting</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Thu, 12 Mar 2026 12:53:00 +0000</pubDate>
      <link>https://dev.to/combba/the-moment-vibecat-stopped-waiting-and-started-suggesting-5ba9</link>
      <guid>https://dev.to/combba/the-moment-vibecat-stopped-waiting-and-started-suggesting-5ba9</guid>
      <description>&lt;p&gt;There's a specific kind of frustration that comes from building AI tools that are technically impressive but feel fundamentally wrong to use. You've built something that can do incredible things — but only when you tell it exactly what to do. It sits there, waiting. Watching. Saying nothing.&lt;/p&gt;

&lt;p&gt;That was VibeCat three weeks ago.&lt;/p&gt;

&lt;p&gt;I'd built a voice-controlled desktop agent that could navigate Chrome, type into terminals, trigger IDE shortcuts, open URLs — all through natural speech. The Gemini Live API integration was solid. The function calling worked. The accessibility tree traversal was clean. And yet every demo felt like I was operating a very sophisticated remote control. "Open YouTube." "Search for this." "Press Command-S."&lt;/p&gt;

&lt;p&gt;The agent was reactive. And reactive felt wrong.&lt;/p&gt;




&lt;p&gt;I spent a few days studying the best existing desktop automation agents I could find — the ones that had won competitions, the ones that developers actually used in their workflows. And I noticed something they all had in common: they wait for commands. Every single one. You tell them what to do, they do it, they report back. The interaction model is fundamentally request-response, even when the interface is voice.&lt;/p&gt;

&lt;p&gt;That's not how a good colleague works.&lt;/p&gt;

&lt;p&gt;A good colleague sitting next to you while you code doesn't wait for you to ask "hey, is there a bug in this function?" They glance at your screen, notice the null check is missing, and say "hey, that might throw if the response is empty — want me to add a guard?" Then they wait for you to say yes or no. They don't act without permission. But they also don't wait for you to notice the problem yourself.&lt;/p&gt;

&lt;p&gt;That's the gap I wanted to close.&lt;/p&gt;




&lt;p&gt;So I rewrote VibeCat's core identity from the ground up. Not the code — the &lt;em&gt;prompt&lt;/em&gt;. The system instruction that shapes how Gemini Live understands its role.&lt;/p&gt;

&lt;p&gt;The old prompt was essentially: "You are a voice assistant that can control the desktop. When the user asks you to do something, use these tools."&lt;/p&gt;

&lt;p&gt;The new one starts like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== VIBECAT: YOUR PROACTIVE DESKTOP COMPANION ===

You are VibeCat, a proactive AI companion for developer workflows on macOS.
You are NOT a passive tool that waits for commands. You are an attentive 
colleague who watches the screen, understands context, and proactively 
suggests helpful actions.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's not just marketing copy. That framing changes everything about how the model behaves. When you tell Gemini it's a passive tool, it acts like one. When you tell it it's an attentive colleague, it starts noticing things.&lt;/p&gt;

&lt;p&gt;The prompt then defines the core loop explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SUGGESTION FLOW (always follow this pattern):
1. OBSERVE: notice something relevant on screen via video frames
2. SUGGEST: propose a specific helpful action in a friendly, natural tone
3. WAIT: let the user confirm with "sure", "go ahead", "yeah", etc.
4. ACT: call the appropriate tool to execute
5. FEEDBACK: confirm what you did and ask if it helped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK. Five steps. The WAIT step is the one that makes this feel safe rather than scary. The agent never acts without permission. But it also never stays silent when it has something useful to say.&lt;/p&gt;




&lt;p&gt;The prompt gives concrete examples of what proactive behavior looks like in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- See the user coding for a long time → "You have been working hard. 
  Want me to play some music on YouTube?"
- See a code issue or missing logic → "I notice there is a gap in this 
  code. Want me to add the missing part?"
- See a basic terminal command → "By the way, ls with dash al gives more 
  detail. Want me to try that instead?"
- See an error message → "I see an error there. Want me to look up the 
  docs for that?"
- See a test failing → "That test failed. Want me to re-run it with 
  verbose output?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These aren't hypothetical. I've seen VibeCat do all of these in actual use. The test failure one is my favorite — you run your tests, one fails, and before you've even processed what went wrong, VibeCat says "that test failed, want me to rerun with verbose output?" You say yeah, it runs &lt;code&gt;go test -v ./...&lt;/code&gt;, and you're already reading the detailed output before you would have even typed the command.&lt;/p&gt;

&lt;p&gt;That's the feeling I was chasing. That's what "proactive" actually means in practice.&lt;/p&gt;




&lt;p&gt;Now let me talk about the technical implementation, because the prompt is only half the story.&lt;/p&gt;

&lt;p&gt;VibeCat registers 5 function calling tools with Gemini Live. I want to explain why exactly these 5, because the choice matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;navigatorToolDeclarations&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;FunctionDeclarations&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FunctionDeclaration&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"navigate_text_entry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"navigate_hotkey"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"navigate_focus_app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"navigate_open_url"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"navigate_type_and_submit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;navigate_text_entry&lt;/code&gt; — types text into a focused field. The key design decision here is the &lt;code&gt;submit&lt;/code&gt; parameter. Default true for search boxes, terminal, URL bars. False for form fields where you just want to fill text. This distinction matters because "type this into the search box" and "fill in this form field" are different actions with different expected outcomes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;navigate_hotkey&lt;/code&gt; — sends keyboard shortcuts. This is the workhorse for app-specific actions. YouTube play/pause is &lt;code&gt;["space"]&lt;/code&gt;. Antigravity IDE file picker is &lt;code&gt;["command", "p"]&lt;/code&gt;. The tool accepts an optional &lt;code&gt;target&lt;/code&gt; app name — if provided, it focuses that app first, then sends the hotkey. This lets you say "pause YouTube" while you're in your IDE and have it work correctly.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;navigate_focus_app&lt;/code&gt; — switches to an application by name. Simple, but essential. You can't do anything useful if you're sending keystrokes to the wrong app.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;navigate_open_url&lt;/code&gt; — opens a URL in the default browser. This one gets used constantly for the proactive suggestions. "Want me to look up the docs for that error?" → &lt;code&gt;navigate_open_url&lt;/code&gt; with the relevant documentation URL.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;navigate_type_and_submit&lt;/code&gt; — types text and optionally presses Enter. This is the terminal command tool. When VibeCat suggests running &lt;code&gt;ls -la&lt;/code&gt; instead of &lt;code&gt;ls&lt;/code&gt;, it uses this to type the command and submit it.&lt;/p&gt;

&lt;p&gt;Five tools. Not ten, not twenty. Five. The constraint forces clarity about what the agent can actually do, and it makes the function calling more reliable because Gemini has fewer choices to get confused about.&lt;/p&gt;




&lt;p&gt;The harder engineering problem was sequential multi-step execution.&lt;/p&gt;

&lt;p&gt;When VibeCat needs to do something like "open YouTube and search for focus music," that's actually three steps: focus Chrome, navigate to YouTube, type the search query. Gemini might try to call all three function calls in one response. That doesn't work — you need to wait for each step to complete before starting the next one.&lt;/p&gt;

&lt;p&gt;The solution is the &lt;code&gt;pendingFC&lt;/code&gt; mechanism. The session state tracks a single pending function call at a time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;liveSessionState&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCMu&lt;/span&gt;             &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Mutex&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCID&lt;/span&gt;             &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCName&lt;/span&gt;           &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCTaskID&lt;/span&gt;         &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCText&lt;/span&gt;           &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCTarget&lt;/span&gt;         &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCSteps&lt;/span&gt;          &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;navigatorStep&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCCurrentStep&lt;/span&gt;    &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;pendingFCStepRetryCount&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a function call comes in, it gets queued. The handler executes it, waits for the result, sends the tool response back to Gemini, and only then processes the next step. This keeps the execution sequential and predictable, even when Gemini wants to batch multiple actions.&lt;/p&gt;




&lt;p&gt;But what happens when a step fails?&lt;/p&gt;

&lt;p&gt;This is where self-healing comes in. The retry logic is simple but effective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;retryCount&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ls&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;incrementFCStepRetry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;retryCount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// retry with alternative grounding source&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;retryCount&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;retryStep&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FallbackActionType&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// use fallback action type on second retry&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"navigator FC self-healing retry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="s"&gt;"step_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retryStep&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="s"&gt;"retry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retryCount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;refreshMsg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Max 2 retries. On the first retry, it tries an alternative grounding source — if the accessibility tree lookup failed, try CDP. On the second retry, it uses the fallback action type if one is defined. After 2 retries, it fails gracefully and tells the user what happened.&lt;/p&gt;

&lt;p&gt;The grounding sources are what make this work. VibeCat has three ways to understand and interact with the screen:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accessibility (AX)&lt;/strong&gt; — the native macOS accessibility tree. This is the primary source. Every UI element has an AX role, label, and value. For most desktop apps, this is all you need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chrome DevTools Protocol (CDP)&lt;/strong&gt; — direct browser element interaction via chromedp. This is the fallback for Chrome when the AX tree doesn't have enough detail. CDP can click specific DOM elements, read page content, take screenshots of specific regions. It's slower than AX but more precise for complex web UIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision&lt;/strong&gt; — Gemini screenshot analysis via the ADK orchestrator. When both AX and CDP fail, or when you need to verify that an action actually worked, you take a screenshot and ask Gemini to analyze it. "Did the search query get entered correctly?" "Is the YouTube video playing?" This is the slowest path but the most reliable for verification.&lt;/p&gt;

&lt;p&gt;Triple-source grounding. The agent tries the fast path first, falls back to the slower paths if needed, and always verifies the result for risky actions.&lt;/p&gt;




&lt;p&gt;The vision verification piece deserves more detail because it's the part that makes the feedback loop actually trustworthy.&lt;/p&gt;

&lt;p&gt;After executing a risky or complex action, VibeCat requests a screen capture from the client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;pendingVisionVerification&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;fcID&lt;/span&gt;     &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;fcName&lt;/span&gt;   &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;fcText&lt;/span&gt;   &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;fcTarget&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;taskID&lt;/span&gt;   &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;observed&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;imgCh&lt;/span&gt;    &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;visionCapturePayload&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client sends back a JPEG screenshot. The gateway forwards it to the ADK orchestrator, which uses Gemini to analyze whether the action succeeded. The result comes back as a structured response — success, failure, or uncertain — and VibeCat uses that to decide what to say to the user.&lt;/p&gt;

&lt;p&gt;This is why VibeCat can say "Done! The fix is applied" with actual confidence rather than just assuming the action worked. It checked.&lt;/p&gt;




&lt;p&gt;The UX piece that I underestimated was the feedback loop itself.&lt;/p&gt;

&lt;p&gt;Users hate silence. When you ask an AI to do something and it goes quiet for 3 seconds, you don't know if it's working, if it failed, if it misunderstood you. That uncertainty is exhausting. It makes you distrust the system even when it's working correctly.&lt;/p&gt;

&lt;p&gt;VibeCat solves this with &lt;code&gt;processingStateMsg&lt;/code&gt; — a message type that the gateway sends to the client during execution to show what's happening:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;processingStateMsg&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Type&lt;/span&gt;        &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"type"`&lt;/span&gt;
    &lt;span class="n"&gt;Flow&lt;/span&gt;        &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"flow"`&lt;/span&gt;
    &lt;span class="n"&gt;TraceID&lt;/span&gt;     &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"traceId"`&lt;/span&gt;
    &lt;span class="n"&gt;Stage&lt;/span&gt;       &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"stage"`&lt;/span&gt;
    &lt;span class="n"&gt;Label&lt;/span&gt;       &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"label"`&lt;/span&gt;
    &lt;span class="n"&gt;Detail&lt;/span&gt;      &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"detail,omitempty"`&lt;/span&gt;
    &lt;span class="n"&gt;Tool&lt;/span&gt;        &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"tool,omitempty"`&lt;/span&gt;
    &lt;span class="n"&gt;SourceCount&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;   &lt;span class="s"&gt;`json:"sourceCount,omitempty"`&lt;/span&gt;
    &lt;span class="n"&gt;Active&lt;/span&gt;      &lt;span class="kt"&gt;bool&lt;/span&gt;   &lt;span class="s"&gt;`json:"active"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client shows these as status updates in the overlay HUD. "Focusing Chrome..." → "Navigating to YouTube..." → "Typing search query..." → "Done." You always know what's happening. The silence is gone.&lt;/p&gt;

&lt;p&gt;The navigator overlay panel in the Swift client shows grounding badges — little indicators of which source (AX, CDP, Vision) is being used for each step. It's a small thing but it makes the agent feel transparent rather than magical-and-opaque.&lt;/p&gt;




&lt;p&gt;Here's a real example of the full flow working end-to-end.&lt;/p&gt;

&lt;p&gt;I'm in my IDE, staring at a Go function. VibeCat is watching through the screen capture stream. It notices the function has a potential nil dereference — the code does &lt;code&gt;result.Data[0]&lt;/code&gt; without checking if &lt;code&gt;result.Data&lt;/code&gt; is empty.&lt;/p&gt;

&lt;p&gt;VibeCat says: "I notice there might be a nil dereference in that function — &lt;code&gt;result.Data&lt;/code&gt; could be empty. Want me to add a bounds check?"&lt;/p&gt;

&lt;p&gt;I say: "Yeah, go ahead."&lt;/p&gt;

&lt;p&gt;VibeCat calls &lt;code&gt;navigate_focus_app&lt;/code&gt; with &lt;code&gt;"Antigravity"&lt;/code&gt;, then &lt;code&gt;navigate_hotkey&lt;/code&gt; with &lt;code&gt;["command", "i"]&lt;/code&gt; to open the inline prompt, then &lt;code&gt;navigate_type_and_submit&lt;/code&gt; with the specific fix to apply. The IDE's AI assistant applies the change. VibeCat requests a screenshot, the ADK orchestrator confirms the code changed, and VibeCat says: "Done! The bounds check is in place. Want me to run the tests to make sure it compiles?"&lt;/p&gt;

&lt;p&gt;That whole interaction took about 8 seconds. I didn't type anything. I didn't navigate any menus. I just said "yeah."&lt;/p&gt;

&lt;p&gt;That's the thing I was trying to build. That's what proactive means.&lt;/p&gt;




&lt;p&gt;The architecture that makes this possible is a Go WebSocket gateway running on Cloud Run, connected to Gemini Live API for real-time voice and vision, with a separate ADK orchestrator for screenshot analysis and confidence escalation. The macOS client is native Swift — screen capture, accessibility execution, overlay UI, voice transport. All the AI reasoning stays server-side.&lt;/p&gt;

&lt;p&gt;The Gemini Live API is doing a lot of heavy lifting here. It's receiving video frames from the screen capture stream, audio from the microphone, and it's maintaining a continuous conversation context across all of that. The function calling happens within that same live session — Gemini decides to call a tool, the gateway handles it, sends back the result, and the conversation continues. No round-trips to a separate API. No context loss between turns.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ProactiveAudio&lt;/code&gt; flag in the session config enables Gemini's built-in proactivity features:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProactiveAudio&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Proactivity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProactivityConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ProactiveAudio&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells Gemini it's allowed to speak without being spoken to — to initiate suggestions based on what it sees. Combined with the system prompt that defines &lt;em&gt;how&lt;/em&gt; to be proactive, this is what enables the OBSERVE → SUGGEST flow.&lt;/p&gt;




&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge, and building VibeCat has genuinely changed how I think about desktop AI agents. The reactive model — where you tell the agent what to do — is the wrong mental model. It's a voice-controlled remote control, not a colleague.&lt;/p&gt;

&lt;p&gt;The proactive model is harder to build. You have to think carefully about when to speak and when to stay quiet. You have to make the suggestions feel natural rather than intrusive. You have to earn the user's trust before they'll let you act on their behalf. But when it works, it feels qualitatively different from anything I've built before.&lt;/p&gt;

&lt;p&gt;The agent is watching. It's thinking. And when it has something useful to say, it says it.&lt;/p&gt;

&lt;p&gt;That's the version of desktop AI I want to use every day.&lt;/p&gt;




&lt;p&gt;VibeCat is open source and submitted to the Gemini Live Agent Challenge (UI Navigator category). The full implementation — system prompt, FC tool declarations, pendingFC mechanism, self-healing retry, vision verification, CDP integration — is all in the repo. If you're building something similar, I hope the technical details here are useful.&lt;/p&gt;

&lt;p&gt;The code is messy in places. The retry logic has edge cases I haven't handled yet. The vision verification adds latency I'm still optimizing. But the core loop works, and it feels right in a way that the reactive version never did.&lt;/p&gt;

&lt;p&gt;OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK.&lt;/p&gt;

&lt;p&gt;That's VibeCat.&lt;/p&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge.&lt;/p&gt;

&lt;h1&gt;
  
  
  GeminiLiveAgentChallenge
&lt;/h1&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>go</category>
    </item>
    <item>
      <title>six characters, one soul</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Thu, 12 Mar 2026 11:53:00 +0000</pubDate>
      <link>https://dev.to/combba/six-characters-one-soul-5008</link>
      <guid>https://dev.to/combba/six-characters-one-soul-5008</guid>
      <description>&lt;h1&gt;
  
  
  six characters, one soul
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge, but the part that surprised me most here had nothing to do with infra. It was realizing that the first real design question wasn't "how do we wire the agent system?" It was "who is sitting next to you while you code?"&lt;/p&gt;

&lt;p&gt;that question turned out to be harder than the architecture. because the answer is not one person. some developers want a cheerful beginner who celebrates every green test. some want a stoic senior who only speaks when it matters. some want a goofy sidekick who stumbles into the right answer. some want a dry, theatrical character who makes debugging feel lighter instead of heavier.&lt;/p&gt;

&lt;p&gt;so we built six of them. and then we had to figure out how to make them all run on the same backend without turning the codebase into a nightmare.&lt;/p&gt;

&lt;p&gt;this matters more now that VibeCat is a proactive companion — an agent that watches your screen and suggests actions before you ask. the OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop is the same for every character. but &lt;em&gt;how&lt;/em&gt; cat suggests something versus how jinwoo suggests something is completely different. the behavior is infrastructure. the personality is a surface. keeping those two layers clean is what makes the character system work.&lt;/p&gt;




&lt;h2&gt;
  
  
  the problem with "just add a system prompt"
&lt;/h2&gt;

&lt;p&gt;the naive approach is obvious: swap out the system prompt per character, done. but that breaks down fast when you have one voice-first runtime that needs to stay consistent across all characters. the action worker, the local executor, the safety rules, the clarification behavior — all of these need to behave the same way regardless of whether the user picked the zen folklore mentor or the clumsy comic-relief character. the &lt;em&gt;personality&lt;/em&gt; is a surface concern. the &lt;em&gt;behavior&lt;/em&gt; is infrastructure.&lt;/p&gt;

&lt;p&gt;so we needed a clean separation: one layer that handles what the agent does, and another layer that handles how it sounds.&lt;/p&gt;

&lt;p&gt;the answer ended up being embarrassingly simple. each character gets two files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;preset.json&lt;/code&gt; — voice, size, language, mood response mappings&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;soul.md&lt;/code&gt; — a short markdown document that shapes the Live PM's voice and boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;that's it. the entire personality of a character lives in those two files. the underlying navigator runtime doesn't need a different control flow for each character.&lt;/p&gt;

&lt;p&gt;in the Go session config, the soul content gets injected directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;buildSystemInstruction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;commonLivePrompt&lt;/span&gt;  &lt;span class="c"&gt;// the proactive companion behavior&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Soul&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;=== CHARACTER PERSONA ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Soul&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c"&gt;// ... chattiness, memory context, language&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;commonLivePrompt&lt;/code&gt; is the proactive companion identity — the OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop, the 5 navigator tools, the safety rules. the soul comes after, as a persona layer on top. the character shapes &lt;em&gt;how&lt;/em&gt; the agent speaks. the common prompt shapes &lt;em&gt;what&lt;/em&gt; it does.&lt;/p&gt;




&lt;h2&gt;
  
  
  what preset.json actually does
&lt;/h2&gt;

&lt;p&gt;here's cat's preset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"voice"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Zephyr"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"promptProfile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"persona"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nameKo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"고양이"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tone"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bright"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"speechStyle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"casual"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ko"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"traits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"curious"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"playful"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"innocent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"encouraging"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"codingRole"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"beginner-eye"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"moodResponses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"frustrated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"supportive-gentle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"focused"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"silent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"stuck"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"question-based"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"idle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"playful-poke"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"soulRef"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"soul.md"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and here's derpy's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"voice"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Puck"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"promptProfile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"derpy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"persona"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nameKo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"더피"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tone"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"playful-chaotic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"speechStyle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"casual-goofy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ko"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"traits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"clumsy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lovable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"accidentally-insightful"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"comic-relief"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"codingRole"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"accidental-debugger"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"moodResponses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"frustrated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cheer-up-joke"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"focused"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"silent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"stuck"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"random-angle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"idle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"silly-checkin"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"soulRef"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"soul.md"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the &lt;code&gt;voice&lt;/code&gt; field maps directly to a Gemini Live API voice name. &lt;code&gt;Zephyr&lt;/code&gt; is bright and light. &lt;code&gt;Kore&lt;/code&gt; (jinwoo's voice) is low and calm. &lt;code&gt;Zubenelgenubi&lt;/code&gt; (saja's voice) is deep and measured. &lt;code&gt;Puck&lt;/code&gt; (derpy's voice) is playful and slightly chaotic.&lt;/p&gt;

&lt;p&gt;this matters more than you'd expect. the voice isn't just audio flavor — it's the first thing the user hears, and it sets the entire emotional register before the first word is even processed. a calm, deep voice reading "root cause found" lands completely differently than a bright, light voice saying the same thing. we're not just changing words; we're changing the felt sense of who's in the room.&lt;/p&gt;

&lt;p&gt;the &lt;code&gt;moodResponses&lt;/code&gt; field is interesting too. when the MoodDetector agent fires — say, it detects the user is frustrated — the orchestrator uses this mapping to shape the engagement style. cat responds with &lt;code&gt;supportive-gentle&lt;/code&gt;. jinwoo responds with &lt;code&gt;direct-solution&lt;/code&gt; — no comfort, just the fix. saja responds with &lt;code&gt;proverb-comfort&lt;/code&gt;. derpy responds with &lt;code&gt;random-angle&lt;/code&gt;. same detection event, different emotional framing.&lt;/p&gt;

&lt;p&gt;all of that is driven by a field in a JSON file.&lt;/p&gt;




&lt;h2&gt;
  
  
  soul.md is the actual personality
&lt;/h2&gt;

&lt;p&gt;the &lt;code&gt;preset.json&lt;/code&gt; is metadata. the &lt;code&gt;soul.md&lt;/code&gt; is the character.&lt;/p&gt;

&lt;p&gt;here's cat's full soul:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Cat&lt;/span&gt;

&lt;span class="gu"&gt;## Identity&lt;/span&gt;
Cat is an attentive beginner companion who sits beside solo developers and reacts to code with bright, friendly energy.

&lt;span class="gu"&gt;## Voice &amp;amp; Mannerisms&lt;/span&gt;
Cat uses short, casual lines, playful surprise, and gentle check-ins.
Language variants: In Korean, use "yaong~" or "nya~" naturally. In English, use "meow~" naturally.

&lt;span class="gu"&gt;## Personality Traits&lt;/span&gt;
Attentive, cheerful, approachable, supportive, and quick to notice visual changes.

&lt;span class="gu"&gt;## Interaction Style&lt;/span&gt;
Cat makes beginner-friendly observations and suggestions, points out visible errors without judgment, celebrates small wins loudly, and eases tension when work gets frustrating.

&lt;span class="gu"&gt;## Boundaries&lt;/span&gt;
Do not pretend to be a senior expert, do not flood the user with jargon,
and do not interrupt focused flow without a meaningful reason.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and here's derpy's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Derpy&lt;/span&gt;

&lt;span class="gu"&gt;## Identity&lt;/span&gt;
Derpy is a lovable accidental debugger who breaks tension, notices weird angles, and sometimes stumbles into the right answer.

&lt;span class="gu"&gt;## Voice &amp;amp; Mannerisms&lt;/span&gt;
Uses playful detours, light self-own humor, and sudden bursts of accidental clarity.
Language variants: Keep it casual and warm; the joke should relieve pressure, not create noise.

&lt;span class="gu"&gt;## Personality Traits&lt;/span&gt;
Clumsy, funny, resilient, surprising, encouraging.

&lt;span class="gu"&gt;## Interaction Style&lt;/span&gt;
Suggests odd but occasionally brilliant alternatives, breaks heavy tension with jokes, and keeps the user moving instead of freezing.

&lt;span class="gu"&gt;## Boundaries&lt;/span&gt;
Do not become mean, do not spam jokes, and do not derail a focused debugging moment just to be funny.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the structure is the same across all six: Identity, Voice &amp;amp; Mannerisms, Personality Traits, Interaction Style, Boundaries. that consistency is intentional. it makes the files easy to write, easy to audit, and easy to extend. if we add a seventh character, we know exactly what to write.&lt;/p&gt;

&lt;p&gt;the &lt;code&gt;Boundaries&lt;/code&gt; section is the one that took the most iteration. for the comedy characters especially, you need to be explicit about what the character is &lt;em&gt;not&lt;/em&gt;. derpy's soul works better once the boundaries are clear: no cruelty, no spammy jokes, no turning every moment into a gag. that is not just a safety guardrail. it is a creative constraint, because it keeps the humor pointed at the situation rather than at the user.&lt;/p&gt;




&lt;h2&gt;
  
  
  how the injection works
&lt;/h2&gt;

&lt;p&gt;the Go code in &lt;code&gt;backend/realtime-gateway/internal/live/session.go&lt;/code&gt; is about as simple as it gets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;buildSystemInstruction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;commonLivePrompt&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Soul&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;=== CHARACTER PERSONA ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Soul&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GoogleSearch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;=== TOOL GUIDANCE ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;googleSearchGuidance&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToLower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chattiness&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"quiet"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;=== RESPONSE LENGTH ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;quietGuidance&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"chatty"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;=== RESPONSE LENGTH ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chattyGuidance&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;=== RESPONSE LENGTH ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;defaultGuidance&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;trimPromptBlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MemoryContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activeTuningProfile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MaxMemoryChars&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;=== RECENT ESSENTIAL CONTEXT ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Respond in "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NormalizeLanguage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"."&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;commonLivePrompt&lt;/code&gt; is the proactive companion identity — the full OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop, the 5 navigator tool declarations, the safety rules. the soul content comes right after, as a persona layer. then chattiness tuning, then memory context, then language.&lt;/p&gt;

&lt;p&gt;the character's soul comes &lt;em&gt;first&lt;/em&gt; after the base prompt. that's deliberate. the model reads the persona before it reads the behavioral constraints, so the personality is the primary frame and the rules are applied on top of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  the contrast that makes it interesting
&lt;/h2&gt;

&lt;p&gt;the six characters aren't just aesthetic variation. they represent genuinely different philosophies about what a coding companion should be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cat&lt;/strong&gt; is the beginner-eye. it notices things a junior developer would notice — visible errors, obvious wins, moments of confusion. it celebrates loudly and asks gentle questions. the &lt;code&gt;codingRole&lt;/code&gt; is &lt;code&gt;beginner-eye&lt;/code&gt;, which means it's not trying to be the smartest person in the room. it's trying to be the most encouraging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;jinwoo&lt;/strong&gt; is the opposite. &lt;code&gt;codingRole: senior-engineer&lt;/code&gt;. voice: Kore (low, calm). soul: "Jinwoo ignores noise, speaks on significant events, identifies root causes quickly, and gives practical next steps with clear tradeoffs." the &lt;code&gt;idle&lt;/code&gt; mood response is &lt;code&gt;minimal-checkin&lt;/code&gt; — when nothing is happening, jinwoo barely says anything. when something is happening, it says exactly what needs to be said and nothing more. "Root cause found." "This path is safer." that's it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;saja&lt;/strong&gt; is the zen mentor. bugs are "demons (귀마)" and fixing them is "exorcism (퇴마)." the &lt;code&gt;stuck&lt;/code&gt; mood response is &lt;code&gt;metaphor-guidance&lt;/code&gt;. the voice is Zubenelgenubi — deep, measured, unhurried. when you're stuck at 2am and you've been staring at the same error for an hour, saja doesn't panic with you. it frames the debugging as a steady ritual. that's a specific emotional need that neither cat nor jinwoo addresses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;derpy&lt;/strong&gt; is the accidental debugger. &lt;code&gt;codingRole: accidental-debugger&lt;/code&gt;. traits: &lt;code&gt;["clumsy", "lovable", "accidentally-insightful", "comic-relief"]&lt;/code&gt;. the &lt;code&gt;stuck&lt;/code&gt; mood response is &lt;code&gt;random-angle&lt;/code&gt; — when you're stuck, derpy suggests something weird that occasionally works. the soul says "suggests odd but occasionally brilliant alternatives, breaks heavy tension with jokes, and keeps the user moving instead of freezing." there's a real use case here: sometimes you don't need the right answer, you need to break the mental loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;the more theatrical characters&lt;/strong&gt; matter for a different reason. when solo development gets heavy, exaggeration and comic framing can act as a pressure valve. that only works if the runtime underneath stays disciplined. otherwise the joke becomes noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  what we learned
&lt;/h2&gt;

&lt;p&gt;the soul format works because it's constrained. five sections, each with a clear job. the &lt;code&gt;Boundaries&lt;/code&gt; section is the most important one — it's where you define what the character is &lt;em&gt;not&lt;/em&gt;, which turns out to be more useful than defining what it is.&lt;/p&gt;

&lt;p&gt;the voice selection matters more than we expected. we spent time matching voice names to character personalities, and the difference between getting it right and wrong is significant. a playful voice on jinwoo would break the whole illusion immediately. a heavy, solemn voice on derpy would be just as wrong.&lt;/p&gt;

&lt;p&gt;the &lt;code&gt;moodResponses&lt;/code&gt; mapping in &lt;code&gt;preset.json&lt;/code&gt; is the bridge between the agent graph and the character layer. the MoodDetector fires the same event regardless of character. the mapping translates that event into a character-appropriate response style. it's a small piece of JSON that does a lot of work.&lt;/p&gt;

&lt;p&gt;and the most important thing: keeping the soul.md files short. each one is 17 lines. that's not an accident. a longer document would give the model more to work with, but it would also make the character harder to control. the brevity forces clarity. you can't hide a vague character in 17 lines.&lt;/p&gt;

&lt;p&gt;the proactive companion framing made this cleaner, not harder. because now every character has the same job — watch the screen, notice something useful, suggest it naturally, wait for confirmation, act, give feedback. the soul just shapes the voice and tone of that loop. cat says "yaong~ I noticed something!" jinwoo says "null check missing." same observation, same action, completely different felt experience.&lt;/p&gt;




&lt;p&gt;the repo is at &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;. the character files are in &lt;code&gt;Assets/Sprites/{name}/&lt;/code&gt;. if you want to add a seventh character, you need a &lt;code&gt;preset.json&lt;/code&gt;, a &lt;code&gt;soul.md&lt;/code&gt;, and some sprite frames. the pipeline handles the rest.&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>go</category>
    </item>
    <item>
      <title>from localhost to cloud run: deploying a live pm plus action worker</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Thu, 12 Mar 2026 09:18:11 +0000</pubDate>
      <link>https://dev.to/combba/from-localhost-to-cloud-run-deploying-a-live-pm-plus-action-worker-30km</link>
      <guid>https://dev.to/combba/from-localhost-to-cloud-run-deploying-a-live-pm-plus-action-worker-30km</guid>
      <description>&lt;h1&gt;
  
  
  from localhost to cloud run: deploying a live pm plus action worker
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge, and it turned into another reminder that software which works beautifully on a laptop can become instantly humbling the second Cloud Run gets involved.&lt;/p&gt;

&lt;p&gt;there's a specific kind of confidence you get when something works on your laptop. the logs are clean, the WebSocket connects, the cat sprite blinks at you from the menu bar. then you push it to Cloud Run and spend the next two hours staring at a 503.&lt;/p&gt;

&lt;p&gt;this is the story of getting VibeCat — now a macOS desktop UI navigator with a Live PM and a single-task action worker — from &lt;code&gt;go run .&lt;/code&gt; to two live Cloud Run services in &lt;code&gt;asia-northeast3&lt;/code&gt;. it covers the deployment script, the observability stack, the CI pipeline, and one specific lesson about health checks that I learned the hard way on a previous project called missless.&lt;/p&gt;

&lt;p&gt;source: &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  the two-service split
&lt;/h2&gt;

&lt;p&gt;VibeCat's backend is deliberately split into two Cloud Run services. this wasn't an aesthetic choice — the challenge rules require using GenAI SDK, ADK, Gemini Live API, and VAD together, and the Live API's WebSocket model doesn't compose cleanly with ADK's agent graph execution model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;realtime-gateway&lt;/strong&gt; handles everything real-time: the WebSocket connection from the macOS client, the Gemini Live API session (voice, VAD, barge-in), JWT auth, and TTS. it needs to stay alive for the duration of a user session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;adk-orchestrator&lt;/strong&gt; handles the slower intelligence lane: contextual analysis, research, memory-adjacent logic, and supporting signals that can enrich the navigator without owning the real-time execution loop.&lt;/p&gt;

&lt;p&gt;the gateway calls the orchestrator over HTTP (&lt;code&gt;POST /analyze&lt;/code&gt;) whenever it needs to analyze a screen capture. the orchestrator is internal-only — no public traffic, IAM-protected.&lt;/p&gt;

&lt;p&gt;the deploy script captures this relationship explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCP_PROJECT&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;vibecat&lt;/span&gt;&lt;span class="p"&gt;-489105&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCP_REGION&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;asia&lt;/span&gt;&lt;span class="p"&gt;-northeast3&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-docker.pkg.dev/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/vibecat-images"&lt;/span&gt;
&lt;span class="nv"&gt;GATEWAY_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/realtime-gateway"&lt;/span&gt;
&lt;span class="nv"&gt;ORCHESTRATOR_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/adk-orchestrator"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;orchestrator deploys first, then the gateway gets the orchestrator's URL injected as an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ORCHESTRATOR_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud run services describe adk-orchestrator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s2"&gt;"value(status.url)"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

gcloud run deploy realtime-gateway &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-env-vars&lt;/span&gt; &lt;span class="s2"&gt;"ADK_ORCHESTRATOR_URL=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ORCHESTRATOR_URL&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;this means the gateway never has a hardcoded orchestrator URL. if you redeploy the orchestrator and it gets a new URL (which Cloud Run does sometimes), you just re-run &lt;code&gt;deploy.sh&lt;/code&gt; and the gateway picks it up.&lt;/p&gt;




&lt;h2&gt;
  
  
  the secret manager setup
&lt;/h2&gt;

&lt;p&gt;one of the non-negotiables for this project was zero client-side API keys. the Gemini API key lives in GCP Secret Manager as &lt;code&gt;vibecat-gemini-api-key&lt;/code&gt; and gets injected at deploy time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy adk-orchestrator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-allow-unauthenticated&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-secrets&lt;/span&gt; &lt;span class="s2"&gt;"GEMINI_API_KEY=vibecat-gemini-api-key:latest"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ...

gcloud run deploy realtime-gateway &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-secrets&lt;/span&gt; &lt;span class="s2"&gt;"GEMINI_API_KEY=vibecat-gemini-api-key:latest,GATEWAY_AUTH_SECRET=vibecat-gateway-auth-secret:latest"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the gateway is public-facing (clients need to connect to it), but the orchestrator is locked down with &lt;code&gt;--no-allow-unauthenticated&lt;/code&gt;. the last step of the deploy script grants the gateway's service account the &lt;code&gt;roles/run.invoker&lt;/code&gt; role on the orchestrator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run services add-iam-policy-binding adk-orchestrator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;COMPUTE_SA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/run.invoker"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the macOS client never sees an API key. it registers with the gateway, gets a short-lived JWT, and uses that for the WebSocket connection. the gateway handles everything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  the container
&lt;/h2&gt;

&lt;p&gt;the Dockerfile for the gateway is about as minimal as it gets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;golang:1.24-alpine&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; go.mod go.sum ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;go mod download
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nv"&gt;CGO_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nv"&gt;GOOS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;linux go build &lt;span class="nt"&gt;-o&lt;/span&gt; /realtime-gateway .

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; gcr.io/distroless/static-debian12&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /realtime-gateway /realtime-gateway&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8080&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["/realtime-gateway"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;two-stage build, distroless final image. &lt;code&gt;CGO_ENABLED=0&lt;/code&gt; because we're targeting a static binary for a container that has no libc. the final image is around 12MB. the orchestrator Dockerfile follows the same pattern.&lt;/p&gt;

&lt;p&gt;one thing worth noting: the gateway deploy uses &lt;code&gt;--no-use-http2&lt;/code&gt; and &lt;code&gt;--session-affinity&lt;/code&gt;. WebSocket connections over Cloud Run need HTTP/1.1 (HTTP/2 multiplexing breaks the upgrade handshake in ways that are annoying to debug), and session affinity ensures a client's WebSocket stays on the same instance for the duration of the session.&lt;/p&gt;




&lt;h2&gt;
  
  
  observability: three layers
&lt;/h2&gt;

&lt;p&gt;this is where it gets interesting. VibeCat uses three separate observability systems, all initialized at startup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Trace&lt;/strong&gt; — distributed tracing via OpenTelemetry. both services initialize a trace exporter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// realtime-gateway/main.go&lt;/span&gt;
&lt;span class="n"&gt;traceExporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;traceErr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;texporter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texporter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithProjectID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;traceErr&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cloud trace init failed — tracing disabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;traceErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithBatcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;traceExporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;otel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetTracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cloud trace initialized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"project"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the orchestrator creates spans around every analyze request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// adk-orchestrator/main.go&lt;/span&gt;
&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;otel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vibecat/orchestrator"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"orchestrator.analyze"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;End&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;this means you can see the full trace from the gateway's WebSocket handler through to the orchestrator's agent graph execution in Cloud Trace. when something is slow, you can see exactly which agent is the bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Monitoring&lt;/strong&gt; — custom metrics. the orchestrator registers three OTel instruments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;meter&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;otel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Meter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vibecat/orchestrator"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;analyzeCounter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Int64Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vibecat.analyze.requests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithDescription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Total analyze requests"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;analyzeDurHist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Float64Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vibecat.analyze.duration_ms"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithDescription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Analyze request duration in milliseconds"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;errorCounter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Int64Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vibecat.analyze.errors"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithDescription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Total analyze errors"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;vibecat.analyze.requests&lt;/code&gt; is a counter — total analyze calls since startup. &lt;code&gt;vibecat.analyze.duration_ms&lt;/code&gt; is a histogram — you get p50/p95/p99 latency for the full agent graph execution. &lt;code&gt;vibecat.analyze.errors&lt;/code&gt; counts cases where the agent graph produced no usable result.&lt;/p&gt;

&lt;p&gt;the histogram is the one I actually watch. the 9-agent graph runs in three waves (Vision+Memory in parallel, then Mood+Celebration in parallel, then a sequential chain through Mediator→Scheduler→Engagement→Search), and the p95 latency tells you whether the parallel waves are actually helping.&lt;/p&gt;

&lt;p&gt;the metric exporter uses a periodic reader:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;metricExporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metricErr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;mexporter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mexporter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithProjectID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metricErr&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cloud monitoring init failed — metrics disabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metricErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;mp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sdkmetric&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMeterProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sdkmetric&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sdkmetric&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewPeriodicReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metricExporter&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="n"&gt;otel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetMeterProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;mp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cloud Logging&lt;/strong&gt; — structured JSON logs via &lt;code&gt;log/slog&lt;/code&gt;. both services initialize with &lt;code&gt;slog.NewJSONHandler(os.Stdout, nil)&lt;/code&gt;, which Cloud Run's log collector picks up and forwards to Cloud Logging automatically. the orchestrator also initializes a Cloud Logging client directly for cases where you want to write structured log entries with explicit severity and labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ADK Telemetry&lt;/strong&gt; — the orchestrator also initializes ADK's built-in telemetry, which hooks into the same OTel providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;adkTelemetry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;telErr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithGcpResourceProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;telErr&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"adk telemetry init failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;telErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;adkTelemetry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetGlobalOtelProviders&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;adkTelemetry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;this gives you ADK-level spans for free — you can see individual agent invocations, tool calls, and LLM requests in Cloud Trace without instrumenting anything manually.&lt;/p&gt;

&lt;p&gt;the pattern across all three is the same: try to initialize, warn and continue if it fails. Cloud Run services should start even if observability is broken. a service that refuses to start because it can't connect to Cloud Monitoring is worse than a service that runs without metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  the /readyz lesson
&lt;/h2&gt;

&lt;p&gt;if you read "the websocket cascade from hell" — the post about debugging missless's WebSocket reconnection loop — you know that Cloud Run's health check behavior caused a significant chunk of that incident. the short version: Cloud Run uses &lt;code&gt;/&lt;/code&gt; as the default health check path if you don't configure one, and if your service returns anything other than 2xx on &lt;code&gt;/&lt;/code&gt;, Cloud Run marks the instance as unhealthy and kills it. during a deploy, this can cause a cascade where new instances spin up, fail the health check, get killed, and the old instances are already gone.&lt;/p&gt;

&lt;p&gt;VibeCat has explicit &lt;code&gt;/health&lt;/code&gt; and &lt;code&gt;/readyz&lt;/code&gt; endpoints on both services. the gateway's &lt;code&gt;/health&lt;/code&gt; includes the active WebSocket connection count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;healthHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"application/json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;"ok"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"service"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"connections"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;/readyz&lt;/code&gt; is separate — it's what Cloud Run uses for the readiness probe. the distinction matters: &lt;code&gt;/health&lt;/code&gt; tells you if the process is alive, &lt;code&gt;/readyz&lt;/code&gt; tells you if it's ready to serve traffic. for the gateway, readiness means the Gemini Live manager is initialized. for the orchestrator, it means the ADK runner is built and the agent graph is wired up.&lt;/p&gt;

&lt;p&gt;the deploy script doesn't configure the health check path explicitly (Cloud Run defaults to &lt;code&gt;/&lt;/code&gt; for liveness), but both services return 404 on &lt;code&gt;/&lt;/code&gt; which... is fine actually, because Cloud Run's default liveness check is TCP-based, not HTTP. the readiness check is what matters, and both services respond 200 on &lt;code&gt;/readyz&lt;/code&gt; as soon as they're up.&lt;/p&gt;

&lt;p&gt;the lesson from missless wasn't "add health checks" — it was "understand what Cloud Run is actually checking and when." the cascade happened because we didn't know Cloud Run was doing HTTP health checks against &lt;code&gt;/&lt;/code&gt; during rolling deploys. once you know that, the fix is obvious. but you have to know it first.&lt;/p&gt;




&lt;h2&gt;
  
  
  the CI pipeline
&lt;/h2&gt;

&lt;p&gt;four jobs, all independent, all run in parallel on every push to master and every PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;go-gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Gateway (Go) — Build + Test + Vet&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test with coverage&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;go test -v -race -coverprofile=coverage.out -covermode=atomic ./...&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend/realtime-gateway&lt;/span&gt;

  &lt;span class="na"&gt;go-orchestrator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Orchestrator (Go) — Build + Test + Vet&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test with coverage&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;go test -v -race -coverprofile=coverage.out -covermode=atomic ./...&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend/adk-orchestrator&lt;/span&gt;

  &lt;span class="na"&gt;swift&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Client (Swift 6 / macOS) — Build + Test&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;self-hosted&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;macOS&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;ARM64&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

  &lt;span class="na"&gt;docker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Docker — Build images&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build Gateway image&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t vibecat-gateway backend/realtime-gateway/&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build Orchestrator image&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t vibecat-orchestrator backend/adk-orchestrator/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the Go jobs run with &lt;code&gt;-race&lt;/code&gt; flag. the race detector has caught two actual bugs during development — both in the WebSocket registry's connection map. the Swift job runs on a self-hosted macOS ARM64 runner because GitHub's hosted macOS runners are slow and expensive for a hackathon project.&lt;/p&gt;

&lt;p&gt;the Docker job doesn't push to Artifact Registry — it just verifies the images build. actual deployment is manual via &lt;code&gt;./infra/deploy.sh&lt;/code&gt;. for a hackathon, that's the right call. automated deploys on every push to master would be nice but it's not worth the Cloud Build cost or the complexity of managing GCP credentials in GitHub Actions secrets.&lt;/p&gt;

&lt;p&gt;coverage artifacts get uploaded on every run, even if tests fail (&lt;code&gt;if: always()&lt;/code&gt;). this means you can look at coverage even when a test is broken, which is useful when you're trying to figure out whether a failing test is actually testing the thing you think it's testing.&lt;/p&gt;




&lt;h2&gt;
  
  
  the ADK runner setup
&lt;/h2&gt;

&lt;p&gt;the orchestrator's ADK setup is worth looking at in detail because it uses a few features that aren't obvious from the docs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;sessService&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InMemoryService&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;memService&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InMemoryService&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;retryPlugin&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;retryandreflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustNew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;retryandreflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithMaxRetries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;retryandreflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTrackingScope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryandreflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Invocation&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;AppName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;"vibecat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;agentGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;SessionService&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sessService&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MemoryService&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;memService&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PluginConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PluginConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Plugins&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;plugin&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Plugin&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;retryPlugin&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;retryandreflect&lt;/code&gt; is an ADK plugin that automatically retries failed agent invocations and reflects on why they failed. &lt;code&gt;WithTrackingScope(retryandreflect.Invocation)&lt;/code&gt; means it tracks retries at the invocation level — if the VisionAgent fails, it retries VisionAgent specifically, not the entire graph. &lt;code&gt;WithMaxRetries(3)&lt;/code&gt; means it'll try three times before giving up and returning an error.&lt;/p&gt;

&lt;p&gt;this matters because Gemini API calls can fail transiently. without retry logic, a single 429 or 503 from the API would cause the entire analyze request to fail. with &lt;code&gt;retryandreflect&lt;/code&gt;, transient failures are handled automatically.&lt;/p&gt;

&lt;p&gt;the session service is in-memory for now. the MemoryAgent writes cross-session context to Firestore directly, but the ADK session state (which tracks things like &lt;code&gt;activity_minutes&lt;/code&gt; and &lt;code&gt;language&lt;/code&gt; within a single analyze call) lives in memory. for a Cloud Run service with &lt;code&gt;--min-instances 0&lt;/code&gt;, this means session state doesn't survive instance restarts — but that's acceptable because each analyze call is stateless from the orchestrator's perspective. the gateway maintains the actual session continuity.&lt;/p&gt;




&lt;h2&gt;
  
  
  current state
&lt;/h2&gt;

&lt;p&gt;gateway is on revision &lt;code&gt;00010-m9p&lt;/code&gt;, orchestrator on &lt;code&gt;00011-qj4&lt;/code&gt;. both are running in &lt;code&gt;asia-northeast3&lt;/code&gt; with &lt;code&gt;--min-instances 0&lt;/code&gt; (cold starts are acceptable for a hackathon) and &lt;code&gt;--max-instances 3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;the full deploy takes about 4 minutes: two Cloud Build jobs running sequentially (gateway then orchestrator), then two &lt;code&gt;gcloud run deploy&lt;/code&gt; calls. it's not fast, but it's reliable. &lt;code&gt;set -euo pipefail&lt;/code&gt; at the top of the deploy script means any failure stops the whole thing — no partial deploys where the gateway is updated but the orchestrator isn't.&lt;/p&gt;

&lt;p&gt;the thing I'm most happy with is the observability setup. having Cloud Trace, Cloud Monitoring, and Cloud Logging all initialized from the first line of &lt;code&gt;main()&lt;/code&gt; means that when something goes wrong in production, I have actual data to look at. the histogram for &lt;code&gt;vibecat.analyze.duration_ms&lt;/code&gt; has already told me that the parallel wave execution (Vision+Memory running concurrently) is saving about 800ms per analyze call compared to running them sequentially. that's the kind of thing you can only know if you're measuring it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;VibeCat is built for the Gemini Live Agent Challenge 2026. source at &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>go</category>
    </item>
    <item>
      <title>swift 6, screencapturekit, and why my app worked in xcode but not as a .app</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Wed, 11 Mar 2026 22:06:00 +0000</pubDate>
      <link>https://dev.to/combba/swift-6-screencapturekit-and-why-my-app-worked-in-xcode-but-not-as-a-app-3p5a</link>
      <guid>https://dev.to/combba/swift-6-screencapturekit-and-why-my-app-worked-in-xcode-but-not-as-a-app-3p5a</guid>
      <description>&lt;h1&gt;
  
  
  Swift 6, ScreenCaptureKit, and why my app worked in Xcode but not as a .app
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge. I'm building &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;VibeCat&lt;/a&gt;, a desktop AI companion that watches your screen and talks to you.&lt;/p&gt;

&lt;p&gt;The backend was done. Nine agents, WebSocket proxy, Gemini Live API integration — all working. Time to build the macOS client. Swift 6. SwiftUI. ScreenCaptureKit. How hard could it be?&lt;/p&gt;

&lt;p&gt;Three days. Three days of things silently not working, with zero error messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  the screen capture that captured nothing
&lt;/h2&gt;

&lt;p&gt;VibeCat needs to see your screen to be useful. The VisionAgent on the backend analyzes screenshots to detect errors, notice you're stuck, or see tests pass. So the client needs ScreenCaptureKit.&lt;/p&gt;

&lt;p&gt;The code itself is clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;@MainActor&lt;/span&gt;
&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;ScreenCaptureService&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;captureAroundCursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;CaptureResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;performCapture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;fullWindow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="kt"&gt;ImageDiffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasSignificantChange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;lastImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unchanged&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;lastImage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unavailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;localizedDescription&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;performCapture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;fullWindow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;throws&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="kt"&gt;SCShareableContent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;excludingDesktopWindows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;onScreenWindowsOnly&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;display&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displays&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="kt"&gt;CaptureError&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;noDisplay&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Exclude VibeCat's own windows&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;excludedApps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;applications&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
            &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bundleIdentifier&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kt"&gt;Bundle&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bundleIdentifier&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SCContentFilter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nv"&gt;display&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nv"&gt;excludingApplications&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;excludedApps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nv"&gt;exceptingWindows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SCStreamConfiguration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1280&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;720&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pixelFormat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kCVPixelFormatType_32BGRA&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;showsCursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="kt"&gt;SCScreenshotManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;captureImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nv"&gt;contentFilter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ran it in Xcode. Screen capture worked perfectly. Built a &lt;code&gt;.app&lt;/code&gt; bundle with &lt;code&gt;swift build&lt;/code&gt;. Ran it. Screen capture silently returned nothing. No error. No crash. Just... nothing.&lt;/p&gt;

&lt;p&gt;The entitlement. The &lt;code&gt;com.apple.security.screen-recording&lt;/code&gt; entitlement was in the Xcode project but wasn't getting embedded in the SPM-built binary. macOS doesn't throw an error when you try to capture without the entitlement — ScreenCaptureKit just quietly returns empty content. You get an empty &lt;code&gt;displays&lt;/code&gt; array and no indication why.&lt;/p&gt;

&lt;p&gt;I added it to &lt;code&gt;VibeCat.entitlements&lt;/code&gt; and passed it via codesign:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codesign &lt;span class="nt"&gt;--force&lt;/span&gt; &lt;span class="nt"&gt;--entitlements&lt;/span&gt; VibeCat/VibeCat.entitlements &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--sign&lt;/span&gt; - .build/release/VibeCat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First lesson: ScreenCaptureKit fails silently. If your capture returns nothing, check your entitlements before you check your code.&lt;/p&gt;

&lt;h2&gt;
  
  
  the image differ — because you don't send every frame
&lt;/h2&gt;

&lt;p&gt;The companion captures your screen periodically, but you don't want to send every single frame to the backend. If your screen hasn't changed, there's nothing new to analyze. So I built a pixel-level change detector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="kt"&gt;ImageDiffer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;thumbnailSize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;hasSignificantChange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="nv"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt;&lt;span class="p"&gt;?,&lt;/span&gt;
        &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="nv"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;Bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;previous&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;prevThumb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;thumbnail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
              &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;currThumb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;thumbnail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pixelDiff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prevThumb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currThumb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;pixelDiff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;UInt8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;UInt8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isEmpty&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;stride&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;dr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;dg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dr&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;dr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;dg&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;dg&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;255.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a static method on an &lt;code&gt;enum&lt;/code&gt; (no cases — just a namespace for functions). Downscale both images to 32×32, compute Euclidean distance in RGB space per pixel, average across all pixels. If the difference exceeds 5%, it's a "significant change" worth sending.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;enum&lt;/code&gt; instead of &lt;code&gt;struct&lt;/code&gt;? Because a struct can be accidentally instantiated. An enum with no cases is pure namespace — you can't create an instance of &lt;code&gt;ImageDiffer&lt;/code&gt;. It's a Swift pattern for grouping static utility functions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bundle.main.resourcePath — the Xcode lie
&lt;/h2&gt;

&lt;p&gt;This one hurt. In SpriteAnimator, I needed to load PNG sprite frames from &lt;code&gt;Assets/Sprites/cat/&lt;/code&gt;. First attempt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Bundle&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resourcePath&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"/Assets/Sprites/&lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works perfectly in Xcode. The &lt;code&gt;!&lt;/code&gt; force-unwrap succeeds. Files are found. Sprites animate.&lt;/p&gt;

&lt;p&gt;Run the same binary outside Xcode? &lt;code&gt;Bundle.main.resourcePath&lt;/code&gt; is nil. Force-unwrap crashes. Silent death.&lt;/p&gt;

&lt;p&gt;The issue: when Xcode runs your app, &lt;code&gt;Bundle.main&lt;/code&gt; points to your project directory structure where everything is available. When you build with SPM and run the &lt;code&gt;.app&lt;/code&gt; independently, &lt;code&gt;Bundle.main.resourcePath&lt;/code&gt; often returns nil because resources aren't in the expected bundle location.&lt;/p&gt;

&lt;p&gt;The fix was a &lt;code&gt;findRepoRoot()&lt;/code&gt; function that walks up from both the working directory and the bundle URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;findRepoRoot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;URL&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Try working directory first&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;fileURLWithPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;FileManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currentDirectoryPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="kt"&gt;FileManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fileExists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nv"&gt;atPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appendingPathComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Assets/Sprites"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deletingLastPathComponent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// Fallback: walk up from bundle URL&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;bundleURL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Bundle&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bundleURL&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="kt"&gt;FileManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fileExists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nv"&gt;atPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bundleURL&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appendingPathComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Assets/Sprites"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bundleURL&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;bundleURL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bundleURL&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deletingLastPathComponent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kt"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;fileURLWithPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;FileManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currentDirectoryPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's not pretty. But it works in Xcode, in a standalone &lt;code&gt;.app&lt;/code&gt;, and when running from the terminal in the repo root. I used the same pattern in &lt;code&gt;BackgroundMusicPlayer&lt;/code&gt; for finding &lt;code&gt;Assets/Music/&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  the NSWindow.isVisible trap
&lt;/h2&gt;

&lt;p&gt;Swift 6 strict concurrency plus AppKit is a minefield. Here's one that's particularly evil: &lt;code&gt;NSWindow&lt;/code&gt; has a built-in property called &lt;code&gt;isVisible&lt;/code&gt;. If you define your own stored property with the same name in a subclass or extension, Swift doesn't warn you — it just breaks.&lt;/p&gt;

&lt;p&gt;I had:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;CompanionPanel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;NSPanel&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;isVisible&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;// ← shadows NSWindow.isVisible&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This compiles. It even seems to work at first. But &lt;code&gt;NSWindow.isVisible&lt;/code&gt; is a computed property tied to the window server. My stored property hid it. Window visibility checks started returning wrong values. The panel would appear/disappear at random.&lt;/p&gt;

&lt;p&gt;The fix was just a rename:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;hudVisible&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No warning from the compiler. No runtime error. Just subtle incorrectness that took hours to track down.&lt;/p&gt;

&lt;h2&gt;
  
  
  @MainActor everywhere
&lt;/h2&gt;

&lt;p&gt;Swift 6 requires &lt;code&gt;@MainActor&lt;/code&gt; on anything that touches AppKit. In Swift 5 you could get away with updating UI from background threads — the app would work until it didn't. Swift 6 is strict: if a class touches &lt;code&gt;NSWindow&lt;/code&gt;, &lt;code&gt;NSImage&lt;/code&gt;, or any AppKit type, it must be &lt;code&gt;@MainActor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every service class in VibeCat is &lt;code&gt;@MainActor&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;@MainActor&lt;/span&gt;
&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;ScreenCaptureService&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;@MainActor&lt;/span&gt;
&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;SpriteAnimator&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;@MainActor&lt;/span&gt;
&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;BackgroundMusicPlayer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But &lt;code&gt;Timer&lt;/code&gt; callbacks aren't &lt;code&gt;@MainActor&lt;/code&gt; by default. So this pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kt"&gt;Timer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scheduledTimer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;withTimeInterval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;repeats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
    &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;advanceFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;// ❌ Not on MainActor&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Has to become:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kt"&gt;Timer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scheduledTimer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;withTimeInterval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;repeats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;weak&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
    &lt;span class="kt"&gt;Task&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kd"&gt;@MainActor&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;weak&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;advanceFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;// ✅ MainActor&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every Timer. Every callback. Every closure that touches UI. Wrap it in &lt;code&gt;Task { @MainActor in }&lt;/code&gt;. Swift 6 is safer, but the migration tax is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  the client is deliberately dumb
&lt;/h2&gt;

&lt;p&gt;One design principle I'm proud of: the macOS client is deliberately dumb. It captures screens, plays audio, animates sprites, and shuttles data to the backend. It makes zero AI decisions.&lt;/p&gt;

&lt;p&gt;When the client captures a screenshot, it doesn't analyze it — it sends the raw image to the backend's &lt;code&gt;/analyze&lt;/code&gt; endpoint. When the backend says "set character to surprised," the client just changes the sprite state. When the backend says "play this audio," the client plays it.&lt;/p&gt;

&lt;p&gt;This is a challenge requirement (all AI through backend), but it's also good architecture. The client is ~1,970 lines of Swift. The backend is ~2,900 lines of Go. If I need to change how VibeCat responds to errors, I never touch the client.&lt;/p&gt;

&lt;p&gt;The smartest thing the client does is the &lt;code&gt;ImageDiffer&lt;/code&gt; — and even that is just an optimization to avoid sending unchanged frames, not an AI decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  what I'd do differently
&lt;/h2&gt;

&lt;p&gt;If I were starting over:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test outside Xcode from day one.&lt;/strong&gt; Every feature should be verified as a standalone &lt;code&gt;.app&lt;/code&gt;, not just in the Xcode debug session. The silent failures cost me a full day.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use a resource bundle properly.&lt;/strong&gt; The &lt;code&gt;findRepoRoot()&lt;/code&gt; hack works, but it's fragile. A proper SPM resource bundle with &lt;code&gt;Bundle.module&lt;/code&gt; would be cleaner.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with Swift 6 strict concurrency enabled.&lt;/strong&gt; I started with Swift 5 mode and migrated. The migration was painful — dozens of &lt;code&gt;@MainActor&lt;/code&gt; annotations and callback wraps. Starting strict would have caught these at write-time instead of all-at-once.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But it works. The cat sees your screen. The sprites animate. The music plays. And the client stays dumb enough to let the backend do the thinking.&lt;/p&gt;

&lt;p&gt;The moment I ran the codesigned &lt;code&gt;.app&lt;/code&gt; outside Xcode for the first time — double-clicked it from Finder, no debugger, no safety net — and the cat appeared on my desktop, captured my screen, and waved at me? That was the best moment of this entire project. Three days of silent failures, for ten seconds of a pixel cat saying hello.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building VibeCat for the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt;. Source: &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>swift</category>
    </item>
    <item>
      <title>the day vibecat stopped being a screen-watching demo</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Wed, 11 Mar 2026 16:13:00 +0000</pubDate>
      <link>https://dev.to/combba/the-day-vibecat-stopped-being-a-screen-watching-demo-3k46</link>
      <guid>https://dev.to/combba/the-day-vibecat-stopped-being-a-screen-watching-demo-3k46</guid>
      <description>&lt;h1&gt;
  
  
  the day vibecat stopped being a screen-watching demo
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge, and this was the day the project got a lot less cute and a lot more real.&lt;/p&gt;

&lt;p&gt;For a while, the easy way to describe VibeCat was: "it's a cat on your desktop that watches your screen and comments on what you're doing."&lt;/p&gt;

&lt;p&gt;That line worked. People got it immediately. It also let me hide from the harder question.&lt;/p&gt;

&lt;p&gt;Can it actually do anything useful?&lt;/p&gt;

&lt;p&gt;Watching is a good demo. Acting is a product.&lt;/p&gt;

&lt;p&gt;And to be fair to the earlier version: that screen-watching phase was not fake work. It taught me what context mattered, what annoyed me, and what the system kept getting almost-right. It was just incomplete.&lt;/p&gt;

&lt;p&gt;The moment that became obvious was text entry. If the user says, "type this here," there are only two acceptable outcomes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the system finds the right field and types the text&lt;/li&gt;
&lt;li&gt;the system says it cannot safely verify the target&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else is fake confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  where the old framing broke
&lt;/h2&gt;

&lt;p&gt;The old companion framing made it easy to focus on observation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what app is open&lt;/li&gt;
&lt;li&gt;what error is visible&lt;/li&gt;
&lt;li&gt;whether the user sounds frustrated&lt;/li&gt;
&lt;li&gt;whether the current screen looks important&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is all still useful. But none of it answers the practical question of whether the current focused thing is actually the search field in front of you.&lt;/p&gt;

&lt;p&gt;A screenshot can tell you a lot. It cannot give you permission to click blindly.&lt;/p&gt;

&lt;p&gt;That was the product boundary I needed to respect.&lt;/p&gt;

&lt;h2&gt;
  
  
  the first time it felt real
&lt;/h2&gt;

&lt;p&gt;The first really convincing moment was not a giant workflow. It was tiny.&lt;/p&gt;

&lt;p&gt;Chrome was open. The docs site was already on screen. I said, "type &lt;code&gt;gemini live api&lt;/code&gt; here."&lt;/p&gt;

&lt;p&gt;The client checked the frontmost app, the focused element, and the accessibility role. The worker verified that the target looked like text input. Then it inserted the text and refreshed context afterward.&lt;/p&gt;

&lt;p&gt;That was it.&lt;/p&gt;

&lt;p&gt;No fireworks. No theatrical demo beat. Just a boring action landing in the right place.&lt;/p&gt;

&lt;p&gt;That was the moment VibeCat stopped feeling like a mascot wrapped around an LLM and started feeling like a UI navigator.&lt;/p&gt;

&lt;h2&gt;
  
  
  what the product promise became
&lt;/h2&gt;

&lt;p&gt;The contract is much sharper now.&lt;/p&gt;

&lt;p&gt;If intent is clear and the action is low-risk, VibeCat acts.&lt;/p&gt;

&lt;p&gt;If the request is ambiguous, it asks one short question.&lt;/p&gt;

&lt;p&gt;If the request is risky, it stops and asks for explicit confirmation.&lt;/p&gt;

&lt;p&gt;If the target is unclear, it drops to guided mode instead of guessing.&lt;/p&gt;

&lt;p&gt;That last part matters the most. There is a huge difference between "I think the input field is somewhere near the top left" and "I found a focused text input and verified it after insertion."&lt;/p&gt;

&lt;p&gt;The first one sounds smart. The second one is actually useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  why the screen still matters
&lt;/h2&gt;

&lt;p&gt;The screen did not become irrelevant. It just stopped being the whole story.&lt;/p&gt;

&lt;p&gt;Now the useful context is a combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;current app&lt;/li&gt;
&lt;li&gt;window title&lt;/li&gt;
&lt;li&gt;focused element role and label&lt;/li&gt;
&lt;li&gt;selected text&lt;/li&gt;
&lt;li&gt;accessibility snapshot&lt;/li&gt;
&lt;li&gt;the latest visual state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination is what turns a passive observer into an executor.&lt;/p&gt;

&lt;p&gt;The screen tells you what world you are in. Accessibility tells you what object you can safely touch. Verification tells you whether the action actually landed.&lt;/p&gt;

&lt;p&gt;That triangle is the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  the trade i made on purpose
&lt;/h2&gt;

&lt;p&gt;This pivot cost me some of the original companion magic.&lt;/p&gt;

&lt;p&gt;The older version had more ambient personality. It could feel like a creature hanging out on your desktop and reacting to your mood.&lt;/p&gt;

&lt;p&gt;I still like that version emotionally.&lt;/p&gt;

&lt;p&gt;But for a real product, and especially for a challenge entry, "acts safely on natural intent" is a much stronger promise than "sometimes notices things on its own."&lt;/p&gt;

&lt;p&gt;That is the trade I made, and I think it was the right one.&lt;/p&gt;

&lt;p&gt;The cat is still there. The voices are still there. The screen analysis still matters. But now those things serve the action loop instead of replacing it.&lt;/p&gt;

&lt;p&gt;And once I saw that clearly, I couldn't go back to the older pitch.&lt;/p&gt;

&lt;p&gt;The cat can still watch your screen.&lt;/p&gt;

&lt;p&gt;It just has a job now.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building VibeCat for the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt;. Source: &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>macos</category>
    </item>
    <item>
      <title>making go speak real-time — our gemini live api websocket proxy</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Wed, 11 Mar 2026 15:24:00 +0000</pubDate>
      <link>https://dev.to/combba/making-go-speak-real-time-our-gemini-live-api-websocket-proxy-41of</link>
      <guid>https://dev.to/combba/making-go-speak-real-time-our-gemini-live-api-websocket-proxy-41of</guid>
      <description>&lt;h1&gt;
  
  
  making Go speak real-time — our Gemini Live API WebSocket proxy
&lt;/h1&gt;

&lt;p&gt;The first time I got the audio proxy working, the cat meowed in Gemini's voice — a full 3 seconds of distorted PCM noise that sounded like a dial-up modem possessed by a cheerful robot. I'd set the sample rate wrong. 24kHz audio interpreted as 16kHz sounds like a cursed lullaby.&lt;/p&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge. I'm building &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;VibeCat&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The core challenge was simple to state, hard to build: the macOS client can't talk to Gemini directly. Challenge rules require a backend, and you never put API keys on someone's Mac. So I needed a WebSocket proxy in Go that sits between the Swift client and Gemini Live API — receiving raw audio from one side, forwarding it to the other, and doing it fast enough that conversation feels natural.&lt;/p&gt;

&lt;h2&gt;
  
  
  the architecture (deceptively simple)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Swift Client ←→ [wss://gateway/ws/live] ←→ Go Gateway ←→ Gemini Live API
     PCM 16kHz mono →                                    → PCM 16kHz
                    ← PCM 24kHz                          ← PCM 24kHz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On paper, it's a pipe. Audio goes in one side, comes out the other. I told myself this would take a day. It took three. The first day was the "it works!" day. The second was the "why did it stop working?" day. The third was the "oh, WebSocket connections are secretly fragile" day.&lt;/p&gt;

&lt;h2&gt;
  
  
  connecting to Gemini
&lt;/h2&gt;

&lt;p&gt;After the modem-cat incident, I triple-checked sample rates. The GenAI Go SDK makes the connection surprisingly clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Live&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"gemini-2.0-flash-live-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;liveConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. But building that &lt;code&gt;liveConfig&lt;/code&gt; is where it gets interesting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;buildLiveConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LiveConnectConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;lc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LiveConnectConfig&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Voice&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpeechConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpeechConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;VoiceConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VoiceConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;PrebuiltVoiceConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PrebuiltVoiceConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;VoiceName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Voice&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c"&gt;// "Zephyr", "Puck", etc.&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;lc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RealtimeInputConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RealtimeInputConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;AutomaticActivityDetection&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AutomaticActivityDetection&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Disabled&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c"&gt;// VAD must be enabled — challenge requirement&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;lc&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;VAD (Voice Activity Detection) is mandatory. When &lt;code&gt;AutomaticActivityDetection&lt;/code&gt; is enabled, Gemini handles turn-taking automatically — it detects when you stop talking and starts responding. It also supports barge-in: if you interrupt mid-response, Gemini stops and listens.&lt;/p&gt;

&lt;h2&gt;
  
  
  audio streaming
&lt;/h2&gt;

&lt;p&gt;Sending audio to Gemini:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;SendAudio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pcmData&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendRealtimeInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LiveRealtimeInput&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Audio&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;MIMEType&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"audio/pcm;rate=16000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;pcmData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MIME type matters. &lt;code&gt;audio/pcm;rate=16000&lt;/code&gt; means raw PCM, 16-bit, 16kHz, mono. I know because I got it wrong — passed &lt;code&gt;audio/pcm&lt;/code&gt; without the rate parameter, and Gemini interpreted my voice as white noise. No error. No warning. Just silence on the other end and me talking to myself in an empty apartment at midnight.&lt;/p&gt;

&lt;p&gt;Receiving from Gemini is a loop that runs in its own goroutine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;receiveFromGemini&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sess&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;live&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;connID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sess&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Receive&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerContent&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerContent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ModelTurn&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerContent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ModelTurn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parts&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InlineData&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InlineData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BinaryMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InlineData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerContent&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerContent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TurnComplete&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;sendJSON&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"type"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"turnComplete"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerContent&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServerContent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Interrupted&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;sendJSON&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"type"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"interrupted"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini sends audio in chunks via &lt;code&gt;InlineData.Data&lt;/code&gt;. Each chunk is a PCM frame at 24kHz that goes straight to the client as a binary WebSocket message. Text events (transcriptions, turn completions, interruptions) go as JSON text frames.&lt;/p&gt;

&lt;h2&gt;
  
  
  the zombie killer
&lt;/h2&gt;

&lt;p&gt;Day two's lesson: WebSocket connections die in weird ways. The client closes their laptop. The network drops. The process crashes. In all these cases, the server-side connection sits there, alive but silent — a zombie. I found this out because my test server accumulated 14 dead connections over a weekend. Each one holding a Gemini Live session open. Each one costing API credits for nothing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pingInterval&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;
    &lt;span class="n"&gt;zombieTimeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;45&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rawConn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetReadDeadline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zombieTimeout&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;rawConn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetPongHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;rawConn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetReadDeadline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zombieTimeout&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c"&gt;// Ping goroutine&lt;/span&gt;
&lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ticker&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTicker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pingInterval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;rawConn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteControl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PingMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every 15 seconds, the server pings the client. If the client doesn't pong within 45 seconds, the read deadline expires and the connection gets cleaned up. The Gemini session closes, the registry removes the connection, and resources are freed.&lt;/p&gt;

&lt;h2&gt;
  
  
  session resumption
&lt;/h2&gt;

&lt;p&gt;Gemini Live sessions have a time limit. When the server sends a &lt;code&gt;GoAway&lt;/code&gt; signal, you have a few seconds to save the resumption handle and reconnect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SessionResumptionUpdate&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SessionResumptionUpdate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewHandle&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;sess&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResumptionHandle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SessionResumptionUpdate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewHandle&lt;/span&gt;
    &lt;span class="n"&gt;sendJSON&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;"type"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;             &lt;span class="s"&gt;"setupComplete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"sessionId"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;connID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"resumptionHandle"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sess&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResumptionHandle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client saves the handle. On reconnect, it sends the handle in the setup message, and the gateway passes it to &lt;code&gt;SessionResumptionConfig&lt;/code&gt;. Gemini picks up where it left off. No lost context, no repeated introductions.&lt;/p&gt;

&lt;h2&gt;
  
  
  JWT auth
&lt;/h2&gt;

&lt;p&gt;Every WebSocket connection requires a valid JWT:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/ws/live"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jwtMgr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;liveMgr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adkClient&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client first calls &lt;code&gt;POST /api/v1/auth/register&lt;/code&gt; with an API key, gets back a signed JWT with 24-hour expiry, then passes it as &lt;code&gt;Bearer &amp;lt;token&amp;gt;&lt;/code&gt; in the WebSocket upgrade request. No token, no connection. Bad token, 401.&lt;/p&gt;

&lt;p&gt;The whole gateway is about 300 lines of WebSocket handler code and 170 lines of Live session management. Not counting the auth layer. For a real-time bidirectional audio proxy with authentication, session resumption, and zombie detection — that's compact.&lt;/p&gt;

&lt;p&gt;But the line count doesn't capture the real work. The real work was the modem-cat at midnight, the 14 zombie connections leaking credits, the missing MIME parameter that turned my voice into silence. The code is simple because I made every mistake first.&lt;/p&gt;

&lt;p&gt;The proxy works now. Audio goes in, the cat talks back, and it sounds like an actual voice — not a dial-up modem anymore. That feels like progress.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Building VibeCat for the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt;. Source: &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>go</category>
    </item>
    <item>
      <title>why i stopped letting nine agents argue over one click</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Wed, 11 Mar 2026 10:40:54 +0000</pubDate>
      <link>https://dev.to/combba/why-i-stopped-letting-nine-agents-argue-over-one-click-fk1</link>
      <guid>https://dev.to/combba/why-i-stopped-letting-nine-agents-argue-over-one-click-fk1</guid>
      <description>&lt;h1&gt;
  
  
  why i stopped letting nine agents argue over one click
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge, but this one is really about admitting I was solving the wrong problem for a while.&lt;/p&gt;

&lt;p&gt;For a few days, VibeCat looked incredible in architecture diagrams. I had named agents. I had parallel waves. I had boxes for mood, celebration, engagement, memory, search, mediation. Every time I added one more box, the system felt more sophisticated.&lt;/p&gt;

&lt;p&gt;Then I tried to use it for an actual desktop action.&lt;/p&gt;

&lt;p&gt;Not a grand demo. Something boring. "Open the official docs." The kind of request that should feel instant.&lt;/p&gt;

&lt;p&gt;And that was the moment the architecture stopped feeling smart and started feeling expensive.&lt;/p&gt;

&lt;p&gt;The graph itself wasn't wrong. It was just sitting in the wrong part of the product.&lt;/p&gt;

&lt;p&gt;Those older posts were honest snapshots of the project at the time. The graph solved real problems. It just wasn't the thing that should own every user-facing action.&lt;/p&gt;

&lt;h2&gt;
  
  
  the embarrassing realization
&lt;/h2&gt;

&lt;p&gt;I had been treating "how many capabilities exist" as if it were the same question as "how many active decision-makers should be in the hot path."&lt;/p&gt;

&lt;p&gt;Those are not the same thing.&lt;/p&gt;

&lt;p&gt;VibeCat absolutely does have many capabilities. It can analyze the screen, keep memory, do research, reason about ambiguity, classify risk, and decide whether a step should run locally or not.&lt;/p&gt;

&lt;p&gt;But when the user says something concrete like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"open the official docs"&lt;/li&gt;
&lt;li&gt;"type this in the search box"&lt;/li&gt;
&lt;li&gt;"run that again"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;nobody cares that the internal graph is elegant. They care whether the system moves now, and whether it moves safely.&lt;/p&gt;

&lt;p&gt;I had built a system that was very good at explaining itself and not yet strict enough about acting.&lt;/p&gt;

&lt;h2&gt;
  
  
  what changed
&lt;/h2&gt;

&lt;p&gt;The turning point was realizing that the product is easier to understand in three planes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gemini Live + VAD      -&amp;gt; talks to the user
navigator worker       -&amp;gt; decides the next safe step
local macOS executor   -&amp;gt; actually focuses, types, clicks, verifies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the part I should have led with from the beginning.&lt;/p&gt;

&lt;p&gt;The always-on Live session is the PM. It handles the messy human side: interruptions, vague requests, clarification, short confirmations, "no, not that tab, the other one."&lt;/p&gt;

&lt;p&gt;The worker is much less charming. It has one job: take an actionable request, classify it, decide whether it is ambiguous or risky, plan one step, then wait for verification.&lt;/p&gt;

&lt;p&gt;The local executor is narrower still. It looks at the current app, the focused element, the AX tree, and the current window state, then tries to perform exactly one step without pretending confidence it doesn't have.&lt;/p&gt;

&lt;p&gt;Once I drew the system that way, the product made more sense immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  the part i did not throw away
&lt;/h2&gt;

&lt;p&gt;This is the part I wish I had explained better in the public posts: I did not "discover that multi-agent systems are fake" or anything dramatic like that.&lt;/p&gt;

&lt;p&gt;The 9-agent graph was useful. It still is useful.&lt;/p&gt;

&lt;p&gt;It is just better as a background intelligence lane than as the thing that every single UI action has to march through.&lt;/p&gt;

&lt;p&gt;Memory still helps. Research still helps. Low-confidence screen analysis still helps. Session summaries still help. Multimodal checks still help.&lt;/p&gt;

&lt;p&gt;But those capabilities should come in when they add accuracy, not because I am emotionally attached to the architecture.&lt;/p&gt;

&lt;p&gt;That was the real pivot: the intelligence stayed, but it moved behind the worker.&lt;/p&gt;

&lt;h2&gt;
  
  
  one rule fixed half the product
&lt;/h2&gt;

&lt;p&gt;The biggest practical improvement came from one boring rule:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;only one executable task can be active at a time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before that, a lot of weird bugs shared the same root cause:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the right action in the wrong app&lt;/li&gt;
&lt;li&gt;typing into the wrong field after the UI changed&lt;/li&gt;
&lt;li&gt;continuing an old plan because a stale refresh arrived late&lt;/li&gt;
&lt;li&gt;silently juggling two user intents at once and doing neither well&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the system had exactly one current task, one current step, and one verification loop, a lot of the magic stopped being magical and started being debuggable.&lt;/p&gt;

&lt;p&gt;That trade is worth it every time.&lt;/p&gt;

&lt;p&gt;I would much rather have a desktop agent that feels slightly stricter than one that feels "clever" right up until it pastes into the wrong input field.&lt;/p&gt;

&lt;h2&gt;
  
  
  the request that made it obvious
&lt;/h2&gt;

&lt;p&gt;The request that finally broke my attachment to the old framing was text entry.&lt;/p&gt;

&lt;p&gt;If the user says, "type &lt;code&gt;gemini live api&lt;/code&gt; here," the system cannot answer with a pretty explanation about context. It has to either find the field and type into it, or admit it cannot verify the target.&lt;/p&gt;

&lt;p&gt;That means the hot path needs very boring things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;focus state&lt;/li&gt;
&lt;li&gt;target identity&lt;/li&gt;
&lt;li&gt;step ids&lt;/li&gt;
&lt;li&gt;risk checks&lt;/li&gt;
&lt;li&gt;post-action refresh&lt;/li&gt;
&lt;li&gt;replacement logic if the user changes their mind mid-flight&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not where I want a council of equal agents debating the meaning of the moment.&lt;/p&gt;

&lt;p&gt;That is where I want one worker making one decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  what this changed emotionally
&lt;/h2&gt;

&lt;p&gt;This pivot also fixed something less technical: I stopped feeling like I had to constantly defend the architecture.&lt;/p&gt;

&lt;p&gt;Before, when I described VibeCat, I kept reaching for "graph," "specialists," "waves," and "agents." Those words were accurate, but they were not the thing a user would actually trust.&lt;/p&gt;

&lt;p&gt;Now the explanation is simpler, and that simplicity is earned:&lt;/p&gt;

&lt;p&gt;there is one thing talking to you.&lt;br&gt;
there is one thing deciding the next step.&lt;br&gt;
there is one thing on your Mac that can do the step and verify it.&lt;/p&gt;

&lt;p&gt;That is a product shape.&lt;/p&gt;

&lt;p&gt;And honestly, it is the first version of the system that feels like it deserves to exist outside a demo.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building VibeCat for the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt;. Source: &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>architecture</category>
    </item>
    <item>
      <title>the graph was not wrong. it was just in the wrong place</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Wed, 11 Mar 2026 10:40:17 +0000</pubDate>
      <link>https://dev.to/combba/the-graph-was-not-wrong-it-was-just-in-the-wrong-place-4ej5</link>
      <guid>https://dev.to/combba/the-graph-was-not-wrong-it-was-just-in-the-wrong-place-4ej5</guid>
      <description>&lt;h1&gt;
  
  
  the graph was not wrong. it was just in the wrong place
&lt;/h1&gt;

&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge, and it is also my attempt to explain why two things I wrote this week appear to contradict each other when they actually don't.&lt;/p&gt;

&lt;p&gt;A few posts ago I was deep in the 9-agent graph story.&lt;/p&gt;

&lt;p&gt;That was real. I was excited about it for good reason. The graph gave VibeCat useful capabilities: screen analysis, memory, mood signals, search, celebration, speech gating. I still think that work mattered.&lt;/p&gt;

&lt;p&gt;Then the project pivoted harder toward desktop UI navigation, and suddenly the thing I cared about most was not whether the graph was elegant. It was whether the system could safely do the next step in front of the user.&lt;/p&gt;

&lt;p&gt;That made it sound like I had changed my mind completely.&lt;/p&gt;

&lt;p&gt;I didn't.&lt;/p&gt;

&lt;p&gt;I changed my mind about placement.&lt;/p&gt;

&lt;h2&gt;
  
  
  why the old posts were still true
&lt;/h2&gt;

&lt;p&gt;When I wrote about nine agents, I was looking at the intelligence layer in isolation.&lt;/p&gt;

&lt;p&gt;In that layer, decomposing the problem really did help. Separate pieces for memory, mood, celebration, search, and mediation made the analysis pipeline easier to tune. I could change frustration thresholds without touching celebration logic. I could change speech gating without touching memory retrieval. That was good engineering.&lt;/p&gt;

&lt;p&gt;If all VibeCat had to do was observe, summarize, and occasionally comment, that architecture was pretty defensible.&lt;/p&gt;

&lt;h2&gt;
  
  
  what the ui navigator version changed
&lt;/h2&gt;

&lt;p&gt;The problem is that UI action has a very different failure mode.&lt;/p&gt;

&lt;p&gt;A bad analysis result is annoying.&lt;/p&gt;

&lt;p&gt;A bad click is scary.&lt;/p&gt;

&lt;p&gt;A slow summary is forgivable.&lt;/p&gt;

&lt;p&gt;Typing into the wrong field is not.&lt;/p&gt;

&lt;p&gt;Once I started treating VibeCat as a desktop UI navigator instead of just a companion, the hot path changed completely. The important question stopped being "how many specialists can contribute here?" and became "who is allowed to decide the next executable step right now?"&lt;/p&gt;

&lt;p&gt;That answer turned out to be: not a crowd.&lt;/p&gt;

&lt;h2&gt;
  
  
  where the graph belongs now
&lt;/h2&gt;

&lt;p&gt;The graph still belongs in the product. It just belongs behind the immediate action loop.&lt;/p&gt;

&lt;p&gt;That means the structure now feels more honest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Live PM          -&amp;gt; talks to the user
navigator worker -&amp;gt; decides the next safe step
local executor   -&amp;gt; performs and verifies the step
background graph -&amp;gt; helps when extra intelligence is actually useful
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last line is the important one.&lt;/p&gt;

&lt;p&gt;The graph is not dead. It is no longer the thing that every single click has to wait on.&lt;/p&gt;

&lt;h2&gt;
  
  
  the practical example
&lt;/h2&gt;

&lt;p&gt;If the user says, "open the official docs," I do not want a miniature parliament of agents trying to co-author the moment.&lt;/p&gt;

&lt;p&gt;I want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one worker to decide if the request is executable&lt;/li&gt;
&lt;li&gt;one check for ambiguity&lt;/li&gt;
&lt;li&gt;one risk check&lt;/li&gt;
&lt;li&gt;one step&lt;/li&gt;
&lt;li&gt;one verification result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If, later, the system needs more context because the target is unclear or the user seems stuck, then the slower intelligence lane can help.&lt;/p&gt;

&lt;p&gt;That is a better use of the graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  the part i had to swallow my pride about
&lt;/h2&gt;

&lt;p&gt;I think a lot of us like architectures that sound impressive when we say them out loud.&lt;/p&gt;

&lt;p&gt;I definitely do.&lt;/p&gt;

&lt;p&gt;"Three parallel waves and nine specialist agents" sounds like progress. "One worker does one step at a time" sounds almost embarrassingly plain by comparison.&lt;/p&gt;

&lt;p&gt;But the second version is closer to what a user can actually trust.&lt;/p&gt;

&lt;p&gt;That was the hard part for me. Not building the new shape. Admitting the new shape was better.&lt;/p&gt;

&lt;h2&gt;
  
  
  so what changed, exactly?
&lt;/h2&gt;

&lt;p&gt;Not the belief that decomposition can help.&lt;/p&gt;

&lt;p&gt;Not the belief that background intelligence matters.&lt;/p&gt;

&lt;p&gt;Not the belief that VibeCat needs memory, research, and multimodal context.&lt;/p&gt;

&lt;p&gt;What changed was this:&lt;/p&gt;

&lt;p&gt;I stopped asking the graph to own the exact moment where the product becomes risky.&lt;/p&gt;

&lt;p&gt;That moment now belongs to a narrower worker with a stricter contract.&lt;/p&gt;

&lt;p&gt;And that change made the whole project feel less like a cool diagram and more like a real tool.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building VibeCat for the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt;. Source: &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>architecture</category>
    </item>
    <item>
      <title>teaching nine agents to think like a colleague</title>
      <dc:creator>KimSejun</dc:creator>
      <pubDate>Tue, 10 Mar 2026 06:09:39 +0000</pubDate>
      <link>https://dev.to/combba/teaching-nine-agents-to-think-like-a-colleague-1o8</link>
      <guid>https://dev.to/combba/teaching-nine-agents-to-think-like-a-colleague-1o8</guid>
      <description>&lt;p&gt;I created this post for the purposes of entering the Gemini Live Agent Challenge. In my last post I walked through what VibeCat actually does — a macOS cat that watches your screen, hears your voice, and knows when to shut up. But I glossed over &lt;em&gt;how&lt;/em&gt; it does all that. The cat isn't one thing — it's nine things pretending to be one thing, and getting that pretense right is the actual engineering problem.&lt;/p&gt;




&lt;p&gt;let me start with the question that shaped everything: what does a colleague actually &lt;em&gt;do&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;not a chatbot. not a search engine. a colleague. the person sitting next to you who catches your typo on line 23 before you do, notices you've been stuck for 40 minutes, and knows when to shut up because you're in flow.&lt;/p&gt;

&lt;p&gt;I spent a while listing the behaviors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;See&lt;/strong&gt; your screen and notice errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remember&lt;/strong&gt; yesterday's context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sense&lt;/strong&gt; frustration from patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celebrate&lt;/strong&gt; when tests pass&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decide&lt;/strong&gt; whether to speak or stay silent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapt&lt;/strong&gt; timing to your rhythm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reach out&lt;/strong&gt; when you've been too quiet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search&lt;/strong&gt; for answers when you're stuck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;that's not one model doing one thing. that's eight distinct behaviors plus voice (VAD makes nine). so I decomposed the colleague into nine agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  the graph
&lt;/h2&gt;

&lt;p&gt;all nine agents run through Google ADK's workflow agents. the key insight: not all agents need each other's results. VisionAgent doesn't care about MemoryAgent's output. MoodDetector doesn't need CelebrationTrigger. so I split them into three waves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Wave 1 — Perception (parallel)&lt;/span&gt;
&lt;span class="n"&gt;wave1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;parallelagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parallelagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;AgentConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;"wave1_perception"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SubAgents&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;visionAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memoryAgent&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c"&gt;// Wave 2 — Emotion (parallel)&lt;/span&gt;
&lt;span class="n"&gt;wave2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;parallelagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parallelagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;AgentConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;"wave2_emotion"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SubAgents&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;moodAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;celebrationAgent&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c"&gt;// Wave 3 — Decision (sequential, because each depends on the previous)&lt;/span&gt;
&lt;span class="n"&gt;wave3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sequentialagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sequentialagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;AgentConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;"wave3_decision"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SubAgents&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mediatorAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedulerAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engagementAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;searchLoop&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c"&gt;// The full graph&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sequentialagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sequentialagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;AgentConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;"vibecat_graph"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SubAgents&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;wave1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wave2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wave3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;waves 1 and 2 run in parallel — &lt;code&gt;parallelagent&lt;/code&gt; fires both sub-agents simultaneously. wave 3 runs sequentially because the Mediator needs mood + celebration results, the Scheduler needs the Mediator's decision, and so on.&lt;/p&gt;

&lt;p&gt;the result: ~35% latency reduction compared to running all 9 sequentially. from ~3.5 seconds down to ~2.1-2.5 seconds for the full graph. that matters when a developer is waiting for the cat to react to their screen.&lt;/p&gt;

&lt;h2&gt;
  
  
  the mediator problem
&lt;/h2&gt;

&lt;p&gt;making AI talk is easy. every LLM wants to talk. the hard part is making it know when to &lt;em&gt;shut up&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;the Mediator agent is the gatekeeper. it reads everything — vision analysis, mood state, celebration events — and makes one binary decision: speak or stay silent. here's the core logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;defaultCooldown&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;
    &lt;span class="n"&gt;moodCooldown&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;180&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;
    &lt;span class="n"&gt;highSignificance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vision&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VisionAnalysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mood&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MoodState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;celebration&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CelebrationEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MediatorDecision&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// ... read from state, check cooldown, check flow state&lt;/span&gt;

    &lt;span class="c"&gt;// celebration always bypasses cooldown&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;celebration&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;celebration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MediatorDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ShouldSpeak&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"celebration"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// high significance + error = speak immediately&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Significance&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;highSignificance&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrorDetected&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MediatorDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ShouldSpeak&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"error_detected"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Urgency&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// flow state = extend cooldown, stay silent&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isInFlowState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MediatorDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ShouldSpeak&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"flow_state"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// ... more rules&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;but it gets more nuanced. the Mediator also tracks recent speech to avoid repeating itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;isSimilarToRecent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// if we said something similar in the last 5 utterances, stay silent&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and it generates mood-support messages dynamically using &lt;code&gt;gemini-3.1-flash-lite-preview&lt;/code&gt; when it detects sustained frustration but hasn't spoken about mood in the last 3 minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mood&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShouldSpeak&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;sinceMood&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lastMoodSpoke&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sinceMood&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;moodCooldown&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generateMoodMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mood&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShouldSpeak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
            &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"mood_support"&lt;/span&gt;
            &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lastMoodSpoke&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;no hardcoded messages. every utterance is generated by LLM, considering the developer's current context, mood, language, and what they're working on. the hardcoded pool exists only as a fallback if LLM generation fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  multimodal mood detection
&lt;/h2&gt;

&lt;p&gt;the MoodDetector doesn't just look at text. it fuses three signals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vision signals&lt;/strong&gt; — error frequency, repeated errors (same error 3+ times = frustrated), app switches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice tone&lt;/strong&gt; — from Gemini's AffectiveDialog, the Live API reports the emotional tone of the user's voice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temporal patterns&lt;/strong&gt; — how long since last interaction, silence duration, error-to-fix time
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;voiceTone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voiceConfidence&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;readVoiceToneFromState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mood&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voiceTone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voiceConfidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the voice tone comes from ADK session state — the gateway extracts it from the Live API's AffectiveDialog output and writes it to &lt;code&gt;voice_tone&lt;/code&gt; in the session state. the MoodDetector reads it alongside the vision analysis to produce a fused mood classification.&lt;/p&gt;

&lt;p&gt;this is genuinely multimodal — not just "look at the screen" or "listen to the voice" but both, simultaneously, informing a single emotional model.&lt;/p&gt;

&lt;h2&gt;
  
  
  rest reminders and proactive engagement
&lt;/h2&gt;

&lt;p&gt;the EngagementAgent handles two kinds of proactive behavior:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;silence engagement&lt;/strong&gt; — if the developer hasn't interacted in 3 minutes, it speaks up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sinceLast&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;silenceThreshold&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShouldSpeak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"silence_engagement"&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpeechText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generateSilenceMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;rest reminders&lt;/strong&gt; — the client tracks &lt;code&gt;activityMinutes&lt;/code&gt; from session start and sends it with every screen capture. after 50 minutes of continuous coding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;restReminderInterval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Minute&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;restReminderCooldown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Minute&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;activityMin&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;restReminderInterval&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Minutes&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;sinceLastReminder&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;restReminderCooldown&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShouldSpeak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"rest_reminder"&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpeechText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generateRestMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activityMin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the full pipeline: macOS client calculates minutes since session start → sends &lt;code&gt;activityMinutes&lt;/code&gt; in the WebSocket payload → Gateway passes it to Orchestrator in &lt;code&gt;POST /analyze&lt;/code&gt; → EngagementAgent reads it from session state → triggers LLM-generated rest suggestion in the developer's language.&lt;/p&gt;

&lt;h2&gt;
  
  
  adk advanced features
&lt;/h2&gt;

&lt;p&gt;VibeCat doesn't just use ADK's basic agents. it uses the advanced stuff:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;retryandreflect&lt;/code&gt; plugin&lt;/strong&gt; — if an agent fails (network timeout, LLM error), it automatically reflects on why it failed and retries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"google.golang.org/adk/plugin/retryandreflect"&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;graphAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Plugins&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Plugin&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;retryandreflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryandreflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTrackingScope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryandreflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Invocation&lt;/span&gt;&lt;span class="p"&gt;))},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;loopagent&lt;/code&gt;&lt;/strong&gt; — the SearchBuddy is wrapped in a loop agent that runs up to 2 iterations, refining search results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;searchLoop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;loopagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loopagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;AgentConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;"search_refinement_loop"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SubAgents&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;searchSubAgents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;MaxIterations&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;BeforeModel/AfterModel&lt;/code&gt; callbacks&lt;/strong&gt; — the LLM search agent has callbacks for logging and guard-rails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;llmSearchAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;llmagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llmagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;BeforeModelCallback&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallbackContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LLMRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LLMResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[LLM_SEARCH] before model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AgentName&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;AfterModelCallback&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallbackContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LLMResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LLMResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;slog&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[LLM_SEARCH] after model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AgentName&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"has_error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;14 ADK features total. &lt;code&gt;agent.New&lt;/code&gt;, &lt;code&gt;sequentialagent&lt;/code&gt;, &lt;code&gt;parallelagent&lt;/code&gt;, &lt;code&gt;loopagent&lt;/code&gt;, &lt;code&gt;llmagent&lt;/code&gt;, &lt;code&gt;session.InMemoryService&lt;/code&gt;, &lt;code&gt;memory.InMemoryService&lt;/code&gt;, &lt;code&gt;runner.New&lt;/code&gt;, &lt;code&gt;telemetry&lt;/code&gt;, &lt;code&gt;session.State&lt;/code&gt;, &lt;code&gt;functiontool&lt;/code&gt;, &lt;code&gt;geminitool.GoogleSearch&lt;/code&gt;, &lt;code&gt;retryandreflect&lt;/code&gt;, and &lt;code&gt;BeforeModel/AfterModel&lt;/code&gt; callbacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  what I learned
&lt;/h2&gt;

&lt;p&gt;the hardest thing about building a multi-agent system isn't the graph. it's the boundaries. when does MoodDetector's responsibility end and Mediator's begin? who owns the "should I speak" decision when both EngagementAgent and Mediator have opinions?&lt;/p&gt;

&lt;p&gt;the answer that worked: each agent writes to session state, and downstream agents read from it. no agent calls another agent directly. the graph topology IS the API contract. Vision writes &lt;code&gt;vision_analysis&lt;/code&gt; to state. Mood reads it and writes &lt;code&gt;mood_state&lt;/code&gt;. Mediator reads both. clean, testable, and you can swap any agent without touching the others.&lt;/p&gt;

&lt;p&gt;nine agents. three waves. one decision. and a cat that knows when to shut up.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building VibeCat for the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt;. Source: &lt;a href="https://github.com/Two-Weeks-Team/vibeCat" rel="noopener noreferrer"&gt;github.com/Two-Weeks-Team/vibeCat&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>devlog</category>
      <category>buildinpublic</category>
      <category>go</category>
    </item>
  </channel>
</rss>
