DEV Community: KimSejun

submitting vibecat: what 3 weeks of building a desktop AI actually taught me

KimSejun — Mon, 16 Mar 2026 13:58:17 +0000

submitting vibecat: what 3 weeks of building a desktop AI actually taught me

I created this post for the purposes of entering the Gemini Live Agent Challenge. But this isn't a pitch — it's what I actually learned.

Three weeks ago, VibeCat was a failed video transcription app called Missless. Today it's a proactive desktop companion that watches your screen, suggests actions before you ask, and moves your mouse to execute them. We pivoted, rebuilt from scratch, shipped to Google Cloud Run, and submitted to Devpost with hours to spare.

Here's the honest retrospective.

the pivot that saved us

Missless was supposed to do real-time video transcription with Gemini. It worked — technically. But the demo was boring. You'd talk, text appeared on screen, end of demo. No one would watch that for 4 minutes.

The pivot happened on March 4th. We asked ourselves: what if instead of processing video passively, the AI could see the screen and act on what it sees? Not a chatbot. Not a voice assistant. A colleague who happens to be a cat.

VibeCat's core loop came from a single frustrated user request: "LOOK! DECIDE! MOVE! CLICK! VERIFY!" — five words, shouted in all caps. That became the product.

the architecture that actually shipped

macOS Client (Swift 6.2)
  → WebSocket
    → Realtime Gateway (Go, Cloud Run)
      → Gemini Live API (voice + vision + FC)
      → ADK Orchestrator (screenshot analysis)
      → Firestore (session state)
      → Cloud Logging (observability)

Three services. One WebSocket connection. Five function calling tools. That's it.

The simplicity was intentional. We had 3 weeks. Every architectural decision was "what's the minimum that works reliably?" We considered an event-driven microservices setup with separate vision, NLU, and action services. We considered a local-first architecture with on-device models. We considered both and built neither. One gateway, one orchestrator, one client.

The pendingFC mechanism — where function calls queue up and execute strictly one at a time with verification between each — was the most important architectural decision. It added latency but eliminated an entire category of bugs where Gemini would fire three actions simultaneously and corrupt the UI state.

three scenarios, three different nightmares

YouTube Music (S1): The hardest. YouTube renders player controls on <canvas>, invisible to the Accessibility API. Our first approach — keyboard shortcuts — worked but looked robotic. Our second approach — vision-based mouse control — looked incredible but failed 40% of the time because Gemini's coordinate estimates were off by 20+ pixels on Retina displays. The solution: try AX first, fall back to CDP (Chrome DevTools Protocol), then fall back to vision coordinates. Self-healing with max 2 retries. Final success rate: 94%.

Code Enhancement (S2): VibeCat reads your code in Antigravity IDE, proactively suggests improving documentation, types "Enhance the comments for this code" into Gemini Chat, and lets the AI rewrite. This was surprisingly stable — 100% success rate after the first week. The trick was using navigate_text_entry with the AX tree instead of trying to click into the text field.

Terminal Automation (S3): VibeCat switches to Terminal, runs go vet ./..., and verifies the output. Also 100% after stabilization. Terminal is the most AX-friendly app on macOS — every element is properly labeled and positioned.

what gemini live API actually enables

I'd used Gemini's regular API before. The Live API is a different experience entirely.

The killer feature isn't voice or vision individually — it's that they exist in the same session simultaneously. VibeCat can see a screenshot, hear the user say "yeah, fix that," understand both inputs in context, and issue a function call — all in one streaming session with sub-second latency.

Function Calling over Live API was the primitive that made proactive desktop control possible. Without it, we'd need a separate intent classification step, a separate action planning step, and a separate execution step — each adding latency and losing context. With FC, Gemini does all three in one inference pass.

The gotcha: Gemini sometimes hallucinates tool usage. It says "I've typed the command" without actually calling the tool. Our inject-text approach had a 40% failure rate from this. The fix was simple but non-obvious — send actual user voice input instead of programmatic text injection. When the user speaks, Gemini takes the FC path; when we inject text, Gemini sometimes takes the "just respond" path.

the self-healing engine

Most automation agents fail and stop. VibeCat fails, switches to a different approach, and tries again — all while narrating what it's doing to the user.

🔍 Analyzing screen...
▶️ Clicking play button [AX]
⚠️ Button not found — retrying with CDP
▶️ Clicking play button [CDP]  
⚠️ CDP target unavailable — retrying with Vision
▶️ Moving cursor to (847, 423) [Vision]
✅ Music is playing!

Three grounding sources (Accessibility API, Chrome DevTools Protocol, Vision coordinates), max 2 retries, post-action verification via ADK screenshot analysis. The transparent narration turned out to be more important than the retry logic itself — users who see the recovery process trust the system. Users who see silent failure don't.

the demo video pipeline

The demo video deserves its own post, but briefly: we built a fully automated pipeline with Gemini TTS (Zephyr voice for the cat, Charon for narration), MiniMax cloned voice for the human narrator, background music from the actual YouTube Music song played in the demo, and ffmpeg for composition. The dubbing script is a JSON file with millisecond-precision timestamps. Running one shell script regenerates the entire video from source clips.

numbers

17 dev.to devlogs published during the challenge
3 demo scenarios all passing E2E
5 FC tools for desktop control
80+ key codes mapped in AccessibilityNavigator
3 grounding sources with automatic fallback
10+ Cloud Run deployments during the final week
144 seconds of demo video, fully dubbed with 3 distinct voices
1 cat who never sleeps

what I'd tell someone starting a similar project

Start with the demo, not the architecture. We spent the first week building infrastructure and the last week desperately trying to make it demo-ready. If I could restart, I'd record a fake demo video on day one and work backward from "what needs to be real."

Proactive AI is harder than reactive AI. It's easy to make an agent that responds to commands. It's hard to make one that speaks up at the right moment without being annoying. The confirmation gate — always waiting for user approval — was the single most important UX decision. It makes proactive feel safe instead of scary.

Narrate everything. Silent processing feels broken. Transparent processing feels collaborative. Show the user what the AI is doing, why it's doing it, which tool it's using, and whether it succeeded. The overlay panel cost us two days to build and was worth every hour.

Gemini Live API + Function Calling is genuinely powerful. Real-time multimodal input with structured tool invocation in a single session — this combination enables interaction patterns that weren't possible before. It's not perfect (hallucinated tool calls are real), but it's the right foundation for desktop AI agents.

VibeCat started as a joke name. "Vibe coding, but with a cat." Three weeks later, the cat watches your screen, suggests improvements, moves your mouse, and verifies its own work. It's still a cat. But now it's a pretty capable one.

when your agent fails, does it just... stop?

KimSejun — Mon, 16 Mar 2026 13:57:37 +0000

when your agent fails, does it just... stop?

I created this post for the purposes of entering the Gemini Live Agent Challenge. But this particular problem — what happens when an AI action fails — is something every agent builder needs to solve.

Most desktop automation tools have a dirty secret: they're fragile. Click the wrong pixel, target an element that moved, or encounter an unexpected dialog — and the whole sequence collapses. The user sees "Error" and reaches for the keyboard.

VibeCat's self-healing engine was built because we got tired of watching our cat give up.

the failure taxonomy

After running hundreds of test sequences across three apps (Antigravity IDE, Terminal, Chrome), we cataloged the failure modes:

AX target not found — The Accessibility API says the element doesn't exist. Usually because the app hasn't finished rendering, or because the element is inside a canvas/WebGL surface. Frequency: ~15% of first attempts on Chrome.

AX target found but wrong — The element exists but it's the wrong one. A "Play" button that's actually in a different panel, or a text field that looks right but belongs to a different component. Frequency: ~5%.

Click landed but nothing happened — The coordinates were correct, the click fired, but the UI didn't respond. Common with YouTube's debounced event handlers. Frequency: ~10% on YouTube Music.

Action succeeded but verification failed — VibeCat typed the text and it appeared, but the post-action screenshot shows an error dialog or unexpected state. Frequency: ~3%.

max 2 retries, alternative grounding

The self-healing engine is deliberately simple. No complex state machines, no machine learning. Just two rules:

Max 2 retries per step. If it fails three times, stop and tell the user.
Each retry uses a different grounding source. Don't repeat what already failed.

Attempt 1: AX targeting
  → Failed: element not in AX tree

Attempt 2: CDP targeting (chromedp)
  → Failed: Chrome DevTools can't find matching DOM node

Attempt 3: Vision coordinates (Gemini screenshot analysis)
  → Success: clicked at (847, 423), verification passed

The grounding source priority chain is AX → CDP → Vision. But the engine is smart enough to skip sources that don't apply — if you're in Terminal (no browser), CDP is skipped entirely.

Here's the core logic in handler.go:

func (h *Handler) executeWithHealing(ctx context.Context, step *Step) error {
    sources := []GroundingSource{AX, CDP, Vision}

    for attempt := 0; attempt <= maxRetries; attempt++ {
        source := sources[min(attempt, len(sources)-1)]

        err := h.executeStep(ctx, step, source)
        if err == nil {
            verified, verifyErr := h.verifyStep(ctx, step)
            if verifyErr == nil && verified {
                return nil
            }
        }

        h.emitProcessingState("retrying_step", step, attempt+1)
        slog.Info("self-healing retry",
            "step", step.ID,
            "attempt", attempt+1, 
            "failed_source", source,
            "next_source", sources[min(attempt+1, len(sources)-1)])
    }

    return fmt.Errorf("step %s failed after %d attempts", step.ID, maxRetries+1)
}

vision verification: the trust layer

Every action — whether it's typing text, clicking a button, or opening a URL — ends with a verification step. VibeCat captures a fresh screenshot and sends it to the ADK Orchestrator with a specific question: "Did the action succeed?"

This isn't just "did the click register?" It's semantic verification:

After typing "go vet ./..." in Terminal → verify the command output shows "no issues"
After clicking Play on YouTube Music → verify the video element is no longer paused
After opening a URL → verify the expected page content is visible

The ADK Orchestrator uses Gemini's vision model for this analysis. It returns a confidence score and a natural-language explanation. If confidence is below the threshold, the step is marked as failed and healing kicks in.

verification: {
  "success": false,
  "confidence": 0.3,
  "explanation": "The play button appears unchanged. 
                  The video progress bar has not moved."
}
→ trigger retry with CDP grounding

the pendingFC queue: no racing allowed

One subtle failure mode we discovered: Gemini sometimes issues multiple function calls in rapid succession. "Focus Terminal, then type go vet ./..., then press Enter." If these execute in parallel, go vet might get typed into the wrong window because focus_app hasn't completed yet.

The pendingFC mechanism solves this with strict sequential execution:

Gemini sends FC calls → queued in pendingFC
Gateway sends step 1 to client
Client executes, captures verification screenshot
Gateway confirms step 1 → sends step 2
Repeat until queue is empty

No step starts until the previous step's verification passes. This adds latency (~200ms per step for verification) but eliminates an entire class of race condition bugs.

transparent narration: failures feel collaborative

The most impactful design decision wasn't technical — it was UX. VibeCat narrates every step through the overlay panel:

🔍 Reading screen...
📋 Planning 3 steps
▶️ Step 1/3: Focusing Terminal [AX]
⚠️ Retrying Step 1 — switching to CDP
✅ Step 1/3: Terminal focused
▶️ Step 2/3: Typing command...

Users who watched VibeCat fail silently reported it as "broken." Users who watched the same failure with narration reported it as "working through a problem." Same outcome, completely different perception.

The seven processing stages (analyzing_command, planning_steps, executing_step, verifying_result, retrying_step, completing, observing_screen) each have localized labels in English, Korean, and Japanese. The overlay shows a grounding source badge (AX / Vision / Hotkey / System) so you always know how VibeCat is interacting with your screen.

numbers that matter

After implementing self-healing, our end-to-end success rates across 50 test runs:

Scenario	Without healing	With healing
YouTube Music play	62%	94%
Code comment enhancement	88%	100%
Terminal go vet	91%	100%

The remaining 6% failure on YouTube Music is almost entirely due to network latency — the page hasn't finished loading when VibeCat tries to click. A simple "wait for page ready" check would probably push it to 98%+.

what I learned

Self-healing isn't about being clever. It's about being systematic. Catalog your failures, build a fallback chain, verify every step, and tell the user what's happening. The hard part isn't the retry logic — it's the verification. Without reliable post-action verification, you're just clicking blindly and hoping.

And narrate everything. Always narrate everything. Silent AI feels broken. Transparent AI feels collaborative.

teaching a cat to use a mouse — literally

KimSejun — Mon, 16 Mar 2026 10:24:28 +0000

teaching a cat to use a mouse — literally

I created this post for the purposes of entering the Gemini Live Agent Challenge, and honestly this was the feature that almost broke us.

Our user's feedback was blunt: "Why aren't you using vision to control the mouse directly?" And then, more specifically: "The cursor should glide smoothly, find its target visually, move again, and click — that's the WOW factor."

He was right. Sending keyboard shortcuts and accessibility API calls is reliable, but it looks like a script running. A cursor that glides across the screen, finds its target visually, and clicks — that looks like intelligence.

So we built the LOOK → DECIDE → MOVE → CLICK → VERIFY pipeline.

the five-stage pipeline

Here's what happens when VibeCat decides to click something on your screen:

LOOK — VibeCat captures a screenshot via ScreenCaptureKit. This isn't a polling loop; it's triggered when the gateway's proactive companion decides an action is needed. The screenshot goes to Gemini's vision model along with the current AX (Accessibility) snapshot for context.

DECIDE — Gemini analyzes the screenshot and returns a target. This could be "the Play button on YouTube Music at approximately (847, 423)" or "the text field labeled 'Search' in the Antigravity IDE sidebar." The key insight: we don't just get coordinates. We get a semantic description of what to click and why, which feeds into the transparent feedback overlay.

MOVE — animateCursorTo in AccessibilityNavigator.swift smoothly interpolates the cursor position over ~300ms using a cubic easing curve. This is purely cosmetic but it's what makes VibeCat feel like a colleague reaching for the mouse rather than a teleporting robot.

func animateCursorTo(_ target: CGPoint, duration: TimeInterval = 0.3) {
    let start = NSEvent.mouseLocation
    let steps = Int(duration * 60) // 60fps
    for i in 0...steps {
        let t = Double(i) / Double(steps)
        let eased = t * t * (3 - 2 * t) // smoothstep
        let x = start.x + (target.x - start.x) * eased
        let y = start.y + (target.y - start.y) * eased
        CGEvent(mouseEventSource: nil, mouseType: .mouseMoved,
                mouseCursorPosition: CGPoint(x: x, y: y),
                mouseButton: .left)?.post(tap: .cghidEventTap)
        Thread.sleep(forTimeInterval: duration / Double(steps))
    }
}

CLICK — A CGEvent mouse click at the current cursor position. Simple, but the timing matters — we add a 50ms delay after the final move to let the OS register the cursor position before clicking.

VERIFY — Another screenshot capture, sent to the ADK Orchestrator for vision analysis. "Did the button state change? Is the expected content now visible?" If verification fails, the self-healing engine kicks in with an alternative grounding strategy.

three grounding sources, one fallback chain

The real complexity isn't in clicking — it's in finding the right thing to click. VibeCat uses three grounding sources in priority order:

Accessibility API (AX) — The gold standard. macOS exposes UI elements with roles, labels, and positions. When it works, it's pixel-perfect. But YouTube Music renders its player controls on a <canvas> element — completely invisible to AX.
Chrome DevTools Protocol (CDP) — For browser elements AX can't see. Our Go gateway runs chromedp to query DOM elements, get bounding boxes, and execute JavaScript. This catches most canvas-rendered controls.
Vision coordinates — The last resort. Send a screenshot to Gemini, ask "where is the play button?", get approximate pixel coordinates. Less reliable, but it works on literally anything visible on screen.

The self-healing engine (max 2 retries) walks down this chain automatically:

Step 1: Try AX targeting
  → Failed (element not found in AX tree)
Step 2: Try CDP targeting  
  → Failed (Chrome not exposing this element via CDP)
Step 3: Try vision coordinates
  → Got (847, 423), move cursor, click
  → Verify: screenshot shows music is now playing ✓

the YouTube Music problem

YouTube Music was our hardest surface. The player controls are canvas-rendered, the site is a single-page app that mutates state without URL changes, and the search results list doesn't expose individual items as clickable AX elements.

Our solution was multi-layered:

Open YouTube Music via navigate_open_url with the search query pre-filled in the URL
Wait for results to load (vision verification of the page state)
Use vision to find the target song/playlist
animateCursorTo to the result
Click via CGEvent
Verify playback started via CDP document.querySelector('video').paused === false
If verification fails, fallback to video.play() via JavaScript injection

We ran this sequence 5 times consecutively in our rehearsal protocol. It passed every time — but only after we added the video.play() fallback. Pure vision-based clicking had about a 60% success rate on first attempt because Gemini's coordinate estimates were sometimes off by 20-30 pixels.

80 key codes and counting

Beyond mouse control, AccessibilityNavigator.swift maps 80+ macOS key codes for keyboard automation. Things like Cmd+Shift+5 to start screen recording, Cmd+Tab to switch apps, or Ctrl+A to select all text in Terminal. Each key code was manually verified across our three gold-tier surfaces: Antigravity IDE, Terminal, and Chrome.

The overlay panel shows all of this in real time — which grounding source is being used, which step of the pipeline we're in, and whether the last verification passed or failed. Users never see a black box. They see VibeCat working.

what I'd do differently

Honestly? I'd invest more in vision coordinate calibration. The 20-30 pixel offset on Retina displays cost us hours of debugging. We eventually solved it by preferring semantic AX targeting wherever possible and only falling back to raw coordinates as a last resort. But if we'd built a proper coordinate calibration system (test click → verify → adjust offset) from day one, the vision path would have been much more reliable.

The cursor animation, though? That was worth every line of code. When VibeCat smoothly moves the mouse to a YouTube search result and clicks it — people's eyes light up. That's the moment it stops being a demo and starts feeling like the future.

the moment vibecat stopped waiting and started suggesting

KimSejun — Thu, 12 Mar 2026 12:53:00 +0000

There's a specific kind of frustration that comes from building AI tools that are technically impressive but feel fundamentally wrong to use. You've built something that can do incredible things — but only when you tell it exactly what to do. It sits there, waiting. Watching. Saying nothing.

That was VibeCat three weeks ago.

I'd built a voice-controlled desktop agent that could navigate Chrome, type into terminals, trigger IDE shortcuts, open URLs — all through natural speech. The Gemini Live API integration was solid. The function calling worked. The accessibility tree traversal was clean. And yet every demo felt like I was operating a very sophisticated remote control. "Open YouTube." "Search for this." "Press Command-S."

The agent was reactive. And reactive felt wrong.

I spent a few days studying the best existing desktop automation agents I could find — the ones that had won competitions, the ones that developers actually used in their workflows. And I noticed something they all had in common: they wait for commands. Every single one. You tell them what to do, they do it, they report back. The interaction model is fundamentally request-response, even when the interface is voice.

That's not how a good colleague works.

A good colleague sitting next to you while you code doesn't wait for you to ask "hey, is there a bug in this function?" They glance at your screen, notice the null check is missing, and say "hey, that might throw if the response is empty — want me to add a guard?" Then they wait for you to say yes or no. They don't act without permission. But they also don't wait for you to notice the problem yourself.

That's the gap I wanted to close.

So I rewrote VibeCat's core identity from the ground up. Not the code — the prompt. The system instruction that shapes how Gemini Live understands its role.

The old prompt was essentially: "You are a voice assistant that can control the desktop. When the user asks you to do something, use these tools."

The new one starts like this:

=== VIBECAT: YOUR PROACTIVE DESKTOP COMPANION ===

You are VibeCat, a proactive AI companion for developer workflows on macOS.
You are NOT a passive tool that waits for commands. You are an attentive 
colleague who watches the screen, understands context, and proactively 
suggests helpful actions.

That's not just marketing copy. That framing changes everything about how the model behaves. When you tell Gemini it's a passive tool, it acts like one. When you tell it it's an attentive colleague, it starts noticing things.

The prompt then defines the core loop explicitly:

SUGGESTION FLOW (always follow this pattern):
1. OBSERVE: notice something relevant on screen via video frames
2. SUGGEST: propose a specific helpful action in a friendly, natural tone
3. WAIT: let the user confirm with "sure", "go ahead", "yeah", etc.
4. ACT: call the appropriate tool to execute
5. FEEDBACK: confirm what you did and ask if it helped

OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK. Five steps. The WAIT step is the one that makes this feel safe rather than scary. The agent never acts without permission. But it also never stays silent when it has something useful to say.

The prompt gives concrete examples of what proactive behavior looks like in practice:

- See the user coding for a long time → "You have been working hard. 
  Want me to play some music on YouTube?"
- See a code issue or missing logic → "I notice there is a gap in this 
  code. Want me to add the missing part?"
- See a basic terminal command → "By the way, ls with dash al gives more 
  detail. Want me to try that instead?"
- See an error message → "I see an error there. Want me to look up the 
  docs for that?"
- See a test failing → "That test failed. Want me to re-run it with 
  verbose output?"

These aren't hypothetical. I've seen VibeCat do all of these in actual use. The test failure one is my favorite — you run your tests, one fails, and before you've even processed what went wrong, VibeCat says "that test failed, want me to rerun with verbose output?" You say yeah, it runs go test -v ./..., and you're already reading the detailed output before you would have even typed the command.

That's the feeling I was chasing. That's what "proactive" actually means in practice.

Now let me talk about the technical implementation, because the prompt is only half the story.

VibeCat registers 5 function calling tools with Gemini Live. I want to explain why exactly these 5, because the choice matters.

func navigatorToolDeclarations() *genai.Tool {
    return &genai.Tool{
        FunctionDeclarations: []*genai.FunctionDeclaration{
            {Name: "navigate_text_entry", ...},
            {Name: "navigate_hotkey", ...},
            {Name: "navigate_focus_app", ...},
            {Name: "navigate_open_url", ...},
            {Name: "navigate_type_and_submit", ...},
        },
    }
}

navigate_text_entry — types text into a focused field. The key design decision here is the submit parameter. Default true for search boxes, terminal, URL bars. False for form fields where you just want to fill text. This distinction matters because "type this into the search box" and "fill in this form field" are different actions with different expected outcomes.

navigate_hotkey — sends keyboard shortcuts. This is the workhorse for app-specific actions. YouTube play/pause is ["space"]. Antigravity IDE file picker is ["command", "p"]. The tool accepts an optional target app name — if provided, it focuses that app first, then sends the hotkey. This lets you say "pause YouTube" while you're in your IDE and have it work correctly.

navigate_focus_app — switches to an application by name. Simple, but essential. You can't do anything useful if you're sending keystrokes to the wrong app.

navigate_open_url — opens a URL in the default browser. This one gets used constantly for the proactive suggestions. "Want me to look up the docs for that error?" → navigate_open_url with the relevant documentation URL.

navigate_type_and_submit — types text and optionally presses Enter. This is the terminal command tool. When VibeCat suggests running ls -la instead of ls, it uses this to type the command and submit it.

Five tools. Not ten, not twenty. Five. The constraint forces clarity about what the agent can actually do, and it makes the function calling more reliable because Gemini has fewer choices to get confused about.

The harder engineering problem was sequential multi-step execution.

When VibeCat needs to do something like "open YouTube and search for focus music," that's actually three steps: focus Chrome, navigate to YouTube, type the search query. Gemini might try to call all three function calls in one response. That doesn't work — you need to wait for each step to complete before starting the next one.

The solution is the pendingFC mechanism. The session state tracks a single pending function call at a time:

type liveSessionState struct {
    // ...
    pendingFCMu             sync.Mutex
    pendingFCID             string
    pendingFCName           string
    pendingFCTaskID         string
    pendingFCText           string
    pendingFCTarget         string
    pendingFCSteps          []navigatorStep
    pendingFCCurrentStep    string
    pendingFCStepRetryCount int
    // ...
}

When a function call comes in, it gets queued. The handler executes it, waits for the result, sends the tool response back to Gemini, and only then processes the next step. This keeps the execution sequential and predictable, even when Gemini wants to batch multiple actions.

But what happens when a step fails?

This is where self-healing comes in. The retry logic is simple but effective:

retryCount := ls.incrementFCStepRetry()
if retryCount <= 2 {
    // retry with alternative grounding source
    if retryCount == 2 && retryStep.FallbackActionType != "" {
        // use fallback action type on second retry
    }
    slog.Info("navigator FC self-healing retry", 
        "step_id", retryStep.ID, 
        "retry", retryCount, 
        "status", refreshMsg.Status)
}

Max 2 retries. On the first retry, it tries an alternative grounding source — if the accessibility tree lookup failed, try CDP. On the second retry, it uses the fallback action type if one is defined. After 2 retries, it fails gracefully and tells the user what happened.

The grounding sources are what make this work. VibeCat has three ways to understand and interact with the screen:

Accessibility (AX) — the native macOS accessibility tree. This is the primary source. Every UI element has an AX role, label, and value. For most desktop apps, this is all you need.

Chrome DevTools Protocol (CDP) — direct browser element interaction via chromedp. This is the fallback for Chrome when the AX tree doesn't have enough detail. CDP can click specific DOM elements, read page content, take screenshots of specific regions. It's slower than AX but more precise for complex web UIs.

Vision — Gemini screenshot analysis via the ADK orchestrator. When both AX and CDP fail, or when you need to verify that an action actually worked, you take a screenshot and ask Gemini to analyze it. "Did the search query get entered correctly?" "Is the YouTube video playing?" This is the slowest path but the most reliable for verification.

Triple-source grounding. The agent tries the fast path first, falls back to the slower paths if needed, and always verifies the result for risky actions.

The vision verification piece deserves more detail because it's the part that makes the feedback loop actually trustworthy.

After executing a risky or complex action, VibeCat requests a screen capture from the client:

type pendingVisionVerification struct {
    fcID     string
    fcName   string
    fcText   string
    fcTarget string
    taskID   string
    observed string
    imgCh    chan visionCapturePayload
}

The client sends back a JPEG screenshot. The gateway forwards it to the ADK orchestrator, which uses Gemini to analyze whether the action succeeded. The result comes back as a structured response — success, failure, or uncertain — and VibeCat uses that to decide what to say to the user.

This is why VibeCat can say "Done! The fix is applied" with actual confidence rather than just assuming the action worked. It checked.

The UX piece that I underestimated was the feedback loop itself.

Users hate silence. When you ask an AI to do something and it goes quiet for 3 seconds, you don't know if it's working, if it failed, if it misunderstood you. That uncertainty is exhausting. It makes you distrust the system even when it's working correctly.

VibeCat solves this with processingStateMsg — a message type that the gateway sends to the client during execution to show what's happening:

type processingStateMsg struct {
    Type        string `json:"type"`
    Flow        string `json:"flow"`
    TraceID     string `json:"traceId"`
    Stage       string `json:"stage"`
    Label       string `json:"label"`
    Detail      string `json:"detail,omitempty"`
    Tool        string `json:"tool,omitempty"`
    SourceCount *int   `json:"sourceCount,omitempty"`
    Active      bool   `json:"active"`
}

The client shows these as status updates in the overlay HUD. "Focusing Chrome..." → "Navigating to YouTube..." → "Typing search query..." → "Done." You always know what's happening. The silence is gone.

The navigator overlay panel in the Swift client shows grounding badges — little indicators of which source (AX, CDP, Vision) is being used for each step. It's a small thing but it makes the agent feel transparent rather than magical-and-opaque.

Here's a real example of the full flow working end-to-end.

I'm in my IDE, staring at a Go function. VibeCat is watching through the screen capture stream. It notices the function has a potential nil dereference — the code does result.Data[0] without checking if result.Data is empty.

VibeCat says: "I notice there might be a nil dereference in that function — result.Data could be empty. Want me to add a bounds check?"

I say: "Yeah, go ahead."

VibeCat calls navigate_focus_app with "Antigravity", then navigate_hotkey with ["command", "i"] to open the inline prompt, then navigate_type_and_submit with the specific fix to apply. The IDE's AI assistant applies the change. VibeCat requests a screenshot, the ADK orchestrator confirms the code changed, and VibeCat says: "Done! The bounds check is in place. Want me to run the tests to make sure it compiles?"

That whole interaction took about 8 seconds. I didn't type anything. I didn't navigate any menus. I just said "yeah."

That's the thing I was trying to build. That's what proactive means.

The architecture that makes this possible is a Go WebSocket gateway running on Cloud Run, connected to Gemini Live API for real-time voice and vision, with a separate ADK orchestrator for screenshot analysis and confidence escalation. The macOS client is native Swift — screen capture, accessibility execution, overlay UI, voice transport. All the AI reasoning stays server-side.

The Gemini Live API is doing a lot of heavy lifting here. It's receiving video frames from the screen capture stream, audio from the microphone, and it's maintaining a continuous conversation context across all of that. The function calling happens within that same live session — Gemini decides to call a tool, the gateway handles it, sends back the result, and the conversation continues. No round-trips to a separate API. No context loss between turns.

The ProactiveAudio flag in the session config enables Gemini's built-in proactivity features:

if cfg.ProactiveAudio {
    t := true
    lc.Proactivity = &genai.ProactivityConfig{
        ProactiveAudio: &t,
    }
}

This tells Gemini it's allowed to speak without being spoken to — to initiate suggestions based on what it sees. Combined with the system prompt that defines how to be proactive, this is what enables the OBSERVE → SUGGEST flow.

I created this post for the purposes of entering the Gemini Live Agent Challenge, and building VibeCat has genuinely changed how I think about desktop AI agents. The reactive model — where you tell the agent what to do — is the wrong mental model. It's a voice-controlled remote control, not a colleague.

The proactive model is harder to build. You have to think carefully about when to speak and when to stay quiet. You have to make the suggestions feel natural rather than intrusive. You have to earn the user's trust before they'll let you act on their behalf. But when it works, it feels qualitatively different from anything I've built before.

The agent is watching. It's thinking. And when it has something useful to say, it says it.

That's the version of desktop AI I want to use every day.

VibeCat is open source and submitted to the Gemini Live Agent Challenge (UI Navigator category). The full implementation — system prompt, FC tool declarations, pendingFC mechanism, self-healing retry, vision verification, CDP integration — is all in the repo. If you're building something similar, I hope the technical details here are useful.

The code is messy in places. The retry logic has edge cases I haven't handled yet. The vision verification adds latency I'm still optimizing. But the core loop works, and it feels right in a way that the reactive version never did.

OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK.

That's VibeCat.

I created this post for the purposes of entering the Gemini Live Agent Challenge.

GeminiLiveAgentChallenge

six characters, one soul

KimSejun — Thu, 12 Mar 2026 11:53:00 +0000

six characters, one soul

I created this post for the purposes of entering the Gemini Live Agent Challenge, but the part that surprised me most here had nothing to do with infra. It was realizing that the first real design question wasn't "how do we wire the agent system?" It was "who is sitting next to you while you code?"

that question turned out to be harder than the architecture. because the answer is not one person. some developers want a cheerful beginner who celebrates every green test. some want a stoic senior who only speaks when it matters. some want a goofy sidekick who stumbles into the right answer. some want a dry, theatrical character who makes debugging feel lighter instead of heavier.

so we built six of them. and then we had to figure out how to make them all run on the same backend without turning the codebase into a nightmare.

this matters more now that VibeCat is a proactive companion — an agent that watches your screen and suggests actions before you ask. the OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop is the same for every character. but how cat suggests something versus how jinwoo suggests something is completely different. the behavior is infrastructure. the personality is a surface. keeping those two layers clean is what makes the character system work.

the problem with "just add a system prompt"

the naive approach is obvious: swap out the system prompt per character, done. but that breaks down fast when you have one voice-first runtime that needs to stay consistent across all characters. the action worker, the local executor, the safety rules, the clarification behavior — all of these need to behave the same way regardless of whether the user picked the zen folklore mentor or the clumsy comic-relief character. the personality is a surface concern. the behavior is infrastructure.

so we needed a clean separation: one layer that handles what the agent does, and another layer that handles how it sounds.

the answer ended up being embarrassingly simple. each character gets two files:

preset.json — voice, size, language, mood response mappings
soul.md — a short markdown document that shapes the Live PM's voice and boundaries

that's it. the entire personality of a character lives in those two files. the underlying navigator runtime doesn't need a different control flow for each character.

in the Go session config, the soul content gets injected directly:

func buildSystemInstruction(cfg Config) string {
    instruction := commonLivePrompt  // the proactive companion behavior
    if cfg.Soul != "" {
        instruction += "\n\n=== CHARACTER PERSONA ===\n" + cfg.Soul
    }
    // ... chattiness, memory context, language
    return instruction
}

commonLivePrompt is the proactive companion identity — the OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop, the 5 navigator tools, the safety rules. the soul comes after, as a persona layer on top. the character shapes how the agent speaks. the common prompt shapes what it does.

what preset.json actually does

here's cat's preset:

{
  "voice": "Zephyr",
  "promptProfile": "cat",
  "size": null,
  "persona": {
    "nameKo": "고양이",
    "tone": "bright",
    "speechStyle": "casual",
    "language": "ko",
    "traits": ["curious", "playful", "innocent", "encouraging"],
    "codingRole": "beginner-eye",
    "moodResponses": {
      "frustrated": "supportive-gentle",
      "focused": "silent",
      "stuck": "question-based",
      "idle": "playful-poke"
    },
    "soulRef": "soul.md"
  }
}

and here's derpy's:

{
  "voice": "Puck",
  "promptProfile": "derpy",
  "size": null,
  "persona": {
    "nameKo": "더피",
    "tone": "playful-chaotic",
    "speechStyle": "casual-goofy",
    "language": "ko",
    "traits": ["clumsy", "lovable", "accidentally-insightful", "comic-relief"],
    "codingRole": "accidental-debugger",
    "moodResponses": {
      "frustrated": "cheer-up-joke",
      "focused": "silent",
      "stuck": "random-angle",
      "idle": "silly-checkin"
    },
    "soulRef": "soul.md"
  }
}

the voice field maps directly to a Gemini Live API voice name. Zephyr is bright and light. Kore (jinwoo's voice) is low and calm. Zubenelgenubi (saja's voice) is deep and measured. Puck (derpy's voice) is playful and slightly chaotic.

this matters more than you'd expect. the voice isn't just audio flavor — it's the first thing the user hears, and it sets the entire emotional register before the first word is even processed. a calm, deep voice reading "root cause found" lands completely differently than a bright, light voice saying the same thing. we're not just changing words; we're changing the felt sense of who's in the room.

the moodResponses field is interesting too. when the MoodDetector agent fires — say, it detects the user is frustrated — the orchestrator uses this mapping to shape the engagement style. cat responds with supportive-gentle. jinwoo responds with direct-solution — no comfort, just the fix. saja responds with proverb-comfort. derpy responds with random-angle. same detection event, different emotional framing.

all of that is driven by a field in a JSON file.

soul.md is the actual personality

the preset.json is metadata. the soul.md is the character.

here's cat's full soul:

# Cat

## Identity
Cat is an attentive beginner companion who sits beside solo developers and reacts to code with bright, friendly energy.

## Voice & Mannerisms
Cat uses short, casual lines, playful surprise, and gentle check-ins.
Language variants: In Korean, use "yaong~" or "nya~" naturally. In English, use "meow~" naturally.

## Personality Traits
Attentive, cheerful, approachable, supportive, and quick to notice visual changes.

## Interaction Style
Cat makes beginner-friendly observations and suggestions, points out visible errors without judgment, celebrates small wins loudly, and eases tension when work gets frustrating.

## Boundaries
Do not pretend to be a senior expert, do not flood the user with jargon,
and do not interrupt focused flow without a meaningful reason.

and here's derpy's:

# Derpy

## Identity
Derpy is a lovable accidental debugger who breaks tension, notices weird angles, and sometimes stumbles into the right answer.

## Voice & Mannerisms
Uses playful detours, light self-own humor, and sudden bursts of accidental clarity.
Language variants: Keep it casual and warm; the joke should relieve pressure, not create noise.

## Personality Traits
Clumsy, funny, resilient, surprising, encouraging.

## Interaction Style
Suggests odd but occasionally brilliant alternatives, breaks heavy tension with jokes, and keeps the user moving instead of freezing.

## Boundaries
Do not become mean, do not spam jokes, and do not derail a focused debugging moment just to be funny.

the structure is the same across all six: Identity, Voice & Mannerisms, Personality Traits, Interaction Style, Boundaries. that consistency is intentional. it makes the files easy to write, easy to audit, and easy to extend. if we add a seventh character, we know exactly what to write.

the Boundaries section is the one that took the most iteration. for the comedy characters especially, you need to be explicit about what the character is not. derpy's soul works better once the boundaries are clear: no cruelty, no spammy jokes, no turning every moment into a gag. that is not just a safety guardrail. it is a creative constraint, because it keeps the humor pointed at the situation rather than at the user.

how the injection works

the Go code in backend/realtime-gateway/internal/live/session.go is about as simple as it gets:

func buildSystemInstruction(cfg Config) string {
    instruction := commonLivePrompt
    if cfg.Soul != "" {
        instruction += "\n\n=== CHARACTER PERSONA ===\n" + cfg.Soul
    }
    if cfg.GoogleSearch {
        instruction += "\n\n=== TOOL GUIDANCE ===\n" + googleSearchGuidance
    }
    switch strings.ToLower(strings.TrimSpace(cfg.Chattiness)) {
    case "quiet":
        instruction += "\n\n=== RESPONSE LENGTH ===\n" + quietGuidance
    case "chatty":
        instruction += "\n\n=== RESPONSE LENGTH ===\n" + chattyGuidance
    default:
        instruction += "\n\n=== RESPONSE LENGTH ===\n" + defaultGuidance
    }
    if ctx := trimPromptBlock(cfg.MemoryContext, activeTuningProfile.MaxMemoryChars); ctx != "" {
        instruction += "\n\n=== RECENT ESSENTIAL CONTEXT ===\n" + ctx
    }
    instruction += "\n\nRespond in " + lang.NormalizeLanguage(cfg.Language) + "."
    return instruction
}

commonLivePrompt is the proactive companion identity — the full OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop, the 5 navigator tool declarations, the safety rules. the soul content comes right after, as a persona layer. then chattiness tuning, then memory context, then language.

the character's soul comes first after the base prompt. that's deliberate. the model reads the persona before it reads the behavioral constraints, so the personality is the primary frame and the rules are applied on top of it.

the contrast that makes it interesting

the six characters aren't just aesthetic variation. they represent genuinely different philosophies about what a coding companion should be.

cat is the beginner-eye. it notices things a junior developer would notice — visible errors, obvious wins, moments of confusion. it celebrates loudly and asks gentle questions. the codingRole is beginner-eye, which means it's not trying to be the smartest person in the room. it's trying to be the most encouraging.

jinwoo is the opposite. codingRole: senior-engineer. voice: Kore (low, calm). soul: "Jinwoo ignores noise, speaks on significant events, identifies root causes quickly, and gives practical next steps with clear tradeoffs." the idle mood response is minimal-checkin — when nothing is happening, jinwoo barely says anything. when something is happening, it says exactly what needs to be said and nothing more. "Root cause found." "This path is safer." that's it.

saja is the zen mentor. bugs are "demons (귀마)" and fixing them is "exorcism (퇴마)." the stuck mood response is metaphor-guidance. the voice is Zubenelgenubi — deep, measured, unhurried. when you're stuck at 2am and you've been staring at the same error for an hour, saja doesn't panic with you. it frames the debugging as a steady ritual. that's a specific emotional need that neither cat nor jinwoo addresses.

derpy is the accidental debugger. codingRole: accidental-debugger. traits: ["clumsy", "lovable", "accidentally-insightful", "comic-relief"]. the stuck mood response is random-angle — when you're stuck, derpy suggests something weird that occasionally works. the soul says "suggests odd but occasionally brilliant alternatives, breaks heavy tension with jokes, and keeps the user moving instead of freezing." there's a real use case here: sometimes you don't need the right answer, you need to break the mental loop.

the more theatrical characters matter for a different reason. when solo development gets heavy, exaggeration and comic framing can act as a pressure valve. that only works if the runtime underneath stays disciplined. otherwise the joke becomes noise.

what we learned

the soul format works because it's constrained. five sections, each with a clear job. the Boundaries section is the most important one — it's where you define what the character is not, which turns out to be more useful than defining what it is.

the voice selection matters more than we expected. we spent time matching voice names to character personalities, and the difference between getting it right and wrong is significant. a playful voice on jinwoo would break the whole illusion immediately. a heavy, solemn voice on derpy would be just as wrong.

the moodResponses mapping in preset.json is the bridge between the agent graph and the character layer. the MoodDetector fires the same event regardless of character. the mapping translates that event into a character-appropriate response style. it's a small piece of JSON that does a lot of work.

and the most important thing: keeping the soul.md files short. each one is 17 lines. that's not an accident. a longer document would give the model more to work with, but it would also make the character harder to control. the brevity forces clarity. you can't hide a vague character in 17 lines.

the proactive companion framing made this cleaner, not harder. because now every character has the same job — watch the screen, notice something useful, suggest it naturally, wait for confirmation, act, give feedback. the soul just shapes the voice and tone of that loop. cat says "yaong~ I noticed something!" jinwoo says "null check missing." same observation, same action, completely different felt experience.

the repo is at github.com/Two-Weeks-Team/vibeCat. the character files are in Assets/Sprites/{name}/. if you want to add a seventh character, you need a preset.json, a soul.md, and some sprite frames. the pipeline handles the rest.

from localhost to cloud run: deploying a live pm plus action worker

KimSejun — Thu, 12 Mar 2026 09:18:11 +0000

from localhost to cloud run: deploying a live pm plus action worker

I created this post for the purposes of entering the Gemini Live Agent Challenge, and it turned into another reminder that software which works beautifully on a laptop can become instantly humbling the second Cloud Run gets involved.

there's a specific kind of confidence you get when something works on your laptop. the logs are clean, the WebSocket connects, the cat sprite blinks at you from the menu bar. then you push it to Cloud Run and spend the next two hours staring at a 503.

this is the story of getting VibeCat — now a macOS desktop UI navigator with a Live PM and a single-task action worker — from go run . to two live Cloud Run services in asia-northeast3. it covers the deployment script, the observability stack, the CI pipeline, and one specific lesson about health checks that I learned the hard way on a previous project called missless.

source: github.com/Two-Weeks-Team/vibeCat

the two-service split

VibeCat's backend is deliberately split into two Cloud Run services. this wasn't an aesthetic choice — the challenge rules require using GenAI SDK, ADK, Gemini Live API, and VAD together, and the Live API's WebSocket model doesn't compose cleanly with ADK's agent graph execution model.

realtime-gateway handles everything real-time: the WebSocket connection from the macOS client, the Gemini Live API session (voice, VAD, barge-in), JWT auth, and TTS. it needs to stay alive for the duration of a user session.

adk-orchestrator handles the slower intelligence lane: contextual analysis, research, memory-adjacent logic, and supporting signals that can enrich the navigator without owning the real-time execution loop.

the gateway calls the orchestrator over HTTP (POST /analyze) whenever it needs to analyze a screen capture. the orchestrator is internal-only — no public traffic, IAM-protected.

the deploy script captures this relationship explicitly:

PROJECT_ID="${GCP_PROJECT:-vibecat-489105}"
REGION="${GCP_REGION:-asia-northeast3}"
REGISTRY="${REGION}-docker.pkg.dev/${PROJECT_ID}/vibecat-images"
GATEWAY_IMAGE="${REGISTRY}/realtime-gateway"
ORCHESTRATOR_IMAGE="${REGISTRY}/adk-orchestrator"

orchestrator deploys first, then the gateway gets the orchestrator's URL injected as an environment variable:

ORCHESTRATOR_URL=$(gcloud run services describe adk-orchestrator \
  --region "${REGION}" \
  --project "${PROJECT_ID}" \
  --format "value(status.url)")

gcloud run deploy realtime-gateway \
  --set-env-vars "ADK_ORCHESTRATOR_URL=${ORCHESTRATOR_URL}" \
  ...

this means the gateway never has a hardcoded orchestrator URL. if you redeploy the orchestrator and it gets a new URL (which Cloud Run does sometimes), you just re-run deploy.sh and the gateway picks it up.

the secret manager setup

one of the non-negotiables for this project was zero client-side API keys. the Gemini API key lives in GCP Secret Manager as vibecat-gemini-api-key and gets injected at deploy time:

gcloud run deploy adk-orchestrator \
  --no-allow-unauthenticated \
  --set-secrets "GEMINI_API_KEY=vibecat-gemini-api-key:latest" \
  ...

gcloud run deploy realtime-gateway \
  --allow-unauthenticated \
  --set-secrets "GEMINI_API_KEY=vibecat-gemini-api-key:latest,GATEWAY_AUTH_SECRET=vibecat-gateway-auth-secret:latest" \
  ...

the gateway is public-facing (clients need to connect to it), but the orchestrator is locked down with --no-allow-unauthenticated. the last step of the deploy script grants the gateway's service account the roles/run.invoker role on the orchestrator:

gcloud run services add-iam-policy-binding adk-orchestrator \
  --member="serviceAccount:${COMPUTE_SA}" \
  --role="roles/run.invoker" \
  --region="${REGION}" \
  --project="${PROJECT_ID}"

the macOS client never sees an API key. it registers with the gateway, gets a short-lived JWT, and uses that for the WebSocket connection. the gateway handles everything else.

the container

the Dockerfile for the gateway is about as minimal as it gets:

FROM golang:1.24-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /realtime-gateway .

FROM gcr.io/distroless/static-debian12
COPY --from=builder /realtime-gateway /realtime-gateway
EXPOSE 8080
ENTRYPOINT ["/realtime-gateway"]

two-stage build, distroless final image. CGO_ENABLED=0 because we're targeting a static binary for a container that has no libc. the final image is around 12MB. the orchestrator Dockerfile follows the same pattern.

one thing worth noting: the gateway deploy uses --no-use-http2 and --session-affinity. WebSocket connections over Cloud Run need HTTP/1.1 (HTTP/2 multiplexing breaks the upgrade handshake in ways that are annoying to debug), and session affinity ensures a client's WebSocket stays on the same instance for the duration of the session.

observability: three layers

this is where it gets interesting. VibeCat uses three separate observability systems, all initialized at startup.

Cloud Trace — distributed tracing via OpenTelemetry. both services initialize a trace exporter:

// realtime-gateway/main.go
traceExporter, traceErr := texporter.New(texporter.WithProjectID(projectID))
if traceErr != nil {
    slog.Warn("cloud trace init failed — tracing disabled", "error", traceErr)
} else {
    tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(traceExporter))
    otel.SetTracerProvider(tp)
    defer tp.Shutdown(context.Background())
    slog.Info("cloud trace initialized", "project", projectID)
}

the orchestrator creates spans around every analyze request:

// adk-orchestrator/main.go
tracer := otel.Tracer("vibecat/orchestrator")
_, span := tracer.Start(r.Context(), "orchestrator.analyze")
defer span.End()

this means you can see the full trace from the gateway's WebSocket handler through to the orchestrator's agent graph execution in Cloud Trace. when something is slow, you can see exactly which agent is the bottleneck.

Cloud Monitoring — custom metrics. the orchestrator registers three OTel instruments:

meter := otel.Meter("vibecat/orchestrator")
analyzeCounter, _ := meter.Int64Counter("vibecat.analyze.requests",
    metric.WithDescription("Total analyze requests"),
)
analyzeDurHist, _ := meter.Float64Histogram("vibecat.analyze.duration_ms",
    metric.WithDescription("Analyze request duration in milliseconds"),
)
errorCounter, _ := meter.Int64Counter("vibecat.analyze.errors",
    metric.WithDescription("Total analyze errors"),
)

vibecat.analyze.requests is a counter — total analyze calls since startup. vibecat.analyze.duration_ms is a histogram — you get p50/p95/p99 latency for the full agent graph execution. vibecat.analyze.errors counts cases where the agent graph produced no usable result.

the histogram is the one I actually watch. the 9-agent graph runs in three waves (Vision+Memory in parallel, then Mood+Celebration in parallel, then a sequential chain through Mediator→Scheduler→Engagement→Search), and the p95 latency tells you whether the parallel waves are actually helping.

the metric exporter uses a periodic reader:

metricExporter, metricErr := mexporter.New(mexporter.WithProjectID(projectID))
if metricErr != nil {
    slog.Warn("cloud monitoring init failed — metrics disabled", "error", metricErr)
} else {
    mp := sdkmetric.NewMeterProvider(sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExporter)))
    otel.SetMeterProvider(mp)
    defer mp.Shutdown(ctx)
}

Cloud Logging — structured JSON logs via log/slog. both services initialize with slog.NewJSONHandler(os.Stdout, nil), which Cloud Run's log collector picks up and forwards to Cloud Logging automatically. the orchestrator also initializes a Cloud Logging client directly for cases where you want to write structured log entries with explicit severity and labels.

ADK Telemetry — the orchestrator also initializes ADK's built-in telemetry, which hooks into the same OTel providers:

adkTelemetry, telErr := telemetry.New(ctx,
    telemetry.WithGcpResourceProject(projectID),
)
if telErr != nil {
    slog.Warn("adk telemetry init failed", "error", telErr)
} else {
    adkTelemetry.SetGlobalOtelProviders()
    defer adkTelemetry.Shutdown(ctx)
}

this gives you ADK-level spans for free — you can see individual agent invocations, tool calls, and LLM requests in Cloud Trace without instrumenting anything manually.

the pattern across all three is the same: try to initialize, warn and continue if it fails. Cloud Run services should start even if observability is broken. a service that refuses to start because it can't connect to Cloud Monitoring is worse than a service that runs without metrics.

the /readyz lesson

if you read "the websocket cascade from hell" — the post about debugging missless's WebSocket reconnection loop — you know that Cloud Run's health check behavior caused a significant chunk of that incident. the short version: Cloud Run uses / as the default health check path if you don't configure one, and if your service returns anything other than 2xx on /, Cloud Run marks the instance as unhealthy and kills it. during a deploy, this can cause a cascade where new instances spin up, fail the health check, get killed, and the old instances are already gone.

VibeCat has explicit /health and /readyz endpoints on both services. the gateway's /health includes the active WebSocket connection count:

func healthHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    response := map[string]any{
        "status":      "ok",
        "service":     serviceName,
        "connections": registry.Count(),
    }
    json.NewEncoder(w).Encode(response)
}

/readyz is separate — it's what Cloud Run uses for the readiness probe. the distinction matters: /health tells you if the process is alive, /readyz tells you if it's ready to serve traffic. for the gateway, readiness means the Gemini Live manager is initialized. for the orchestrator, it means the ADK runner is built and the agent graph is wired up.

the deploy script doesn't configure the health check path explicitly (Cloud Run defaults to / for liveness), but both services return 404 on / which... is fine actually, because Cloud Run's default liveness check is TCP-based, not HTTP. the readiness check is what matters, and both services respond 200 on /readyz as soon as they're up.

the lesson from missless wasn't "add health checks" — it was "understand what Cloud Run is actually checking and when." the cascade happened because we didn't know Cloud Run was doing HTTP health checks against / during rolling deploys. once you know that, the fix is obvious. but you have to know it first.

the CI pipeline

four jobs, all independent, all run in parallel on every push to master and every PR:

jobs:
  go-gateway:
    name: Gateway (Go) — Build + Test + Vet
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - name: Test with coverage
        run: go test -v -race -coverprofile=coverage.out -covermode=atomic ./...
        working-directory: backend/realtime-gateway

  go-orchestrator:
    name: Orchestrator (Go) — Build + Test + Vet
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - name: Test with coverage
        run: go test -v -race -coverprofile=coverage.out -covermode=atomic ./...
        working-directory: backend/adk-orchestrator

  swift:
    name: Client (Swift 6 / macOS) — Build + Test
    runs-on: [self-hosted, macOS, ARM64]
    timeout-minutes: 10

  docker:
    name: Docker — Build images
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - name: Build Gateway image
        run: docker build -t vibecat-gateway backend/realtime-gateway/
      - name: Build Orchestrator image
        run: docker build -t vibecat-orchestrator backend/adk-orchestrator/

the Go jobs run with -race flag. the race detector has caught two actual bugs during development — both in the WebSocket registry's connection map. the Swift job runs on a self-hosted macOS ARM64 runner because GitHub's hosted macOS runners are slow and expensive for a hackathon project.

the Docker job doesn't push to Artifact Registry — it just verifies the images build. actual deployment is manual via ./infra/deploy.sh. for a hackathon, that's the right call. automated deploys on every push to master would be nice but it's not worth the Cloud Build cost or the complexity of managing GCP credentials in GitHub Actions secrets.

coverage artifacts get uploaded on every run, even if tests fail (if: always()). this means you can look at coverage even when a test is broken, which is useful when you're trying to figure out whether a failing test is actually testing the thing you think it's testing.

the ADK runner setup

the orchestrator's ADK setup is worth looking at in detail because it uses a few features that aren't obvious from the docs:

sessService := session.InMemoryService()
memService := memory.InMemoryService()
retryPlugin := retryandreflect.MustNew(
    retryandreflect.WithMaxRetries(3),
    retryandreflect.WithTrackingScope(retryandreflect.Invocation),
)
r, err := runner.New(runner.Config{
    AppName:        "vibecat",
    Agent:          agentGraph,
    SessionService: sessService,
    MemoryService:  memService,
    PluginConfig: runner.PluginConfig{
        Plugins: []*plugin.Plugin{retryPlugin},
    },
})

retryandreflect is an ADK plugin that automatically retries failed agent invocations and reflects on why they failed. WithTrackingScope(retryandreflect.Invocation) means it tracks retries at the invocation level — if the VisionAgent fails, it retries VisionAgent specifically, not the entire graph. WithMaxRetries(3) means it'll try three times before giving up and returning an error.

this matters because Gemini API calls can fail transiently. without retry logic, a single 429 or 503 from the API would cause the entire analyze request to fail. with retryandreflect, transient failures are handled automatically.

the session service is in-memory for now. the MemoryAgent writes cross-session context to Firestore directly, but the ADK session state (which tracks things like activity_minutes and language within a single analyze call) lives in memory. for a Cloud Run service with --min-instances 0, this means session state doesn't survive instance restarts — but that's acceptable because each analyze call is stateless from the orchestrator's perspective. the gateway maintains the actual session continuity.

current state

gateway is on revision 00010-m9p, orchestrator on 00011-qj4. both are running in asia-northeast3 with --min-instances 0 (cold starts are acceptable for a hackathon) and --max-instances 3.

the full deploy takes about 4 minutes: two Cloud Build jobs running sequentially (gateway then orchestrator), then two gcloud run deploy calls. it's not fast, but it's reliable. set -euo pipefail at the top of the deploy script means any failure stops the whole thing — no partial deploys where the gateway is updated but the orchestrator isn't.

the thing I'm most happy with is the observability setup. having Cloud Trace, Cloud Monitoring, and Cloud Logging all initialized from the first line of main() means that when something goes wrong in production, I have actual data to look at. the histogram for vibecat.analyze.duration_ms has already told me that the parallel wave execution (Vision+Memory running concurrently) is saving about 800ms per analyze call compared to running them sequentially. that's the kind of thing you can only know if you're measuring it.

VibeCat is built for the Gemini Live Agent Challenge 2026. source at github.com/Two-Weeks-Team/vibeCat.

swift 6, screencapturekit, and why my app worked in xcode but not as a .app

KimSejun — Wed, 11 Mar 2026 22:06:00 +0000

Swift 6, ScreenCaptureKit, and why my app worked in Xcode but not as a .app

I created this post for the purposes of entering the Gemini Live Agent Challenge. I'm building VibeCat, a desktop AI companion that watches your screen and talks to you.

The backend was done. Nine agents, WebSocket proxy, Gemini Live API integration — all working. Time to build the macOS client. Swift 6. SwiftUI. ScreenCaptureKit. How hard could it be?

Three days. Three days of things silently not working, with zero error messages.

the screen capture that captured nothing

VibeCat needs to see your screen to be useful. The VisionAgent on the backend analyzes screenshots to detect errors, notice you're stuck, or see tests pass. So the client needs ScreenCaptureKit.

The code itself is clean:

@MainActor
final class ScreenCaptureService {
    func captureAroundCursor() async -> CaptureResult {
        do {
            let image = try await performCapture(fullWindow: false)
            if !ImageDiffer.hasSignificantChange(from: lastImage, to: image) {
                return .unchanged
            }
            lastImage = image
            return .captured(image)
        } catch {
            return .unavailable(error.localizedDescription)
        }
    }

    private func performCapture(fullWindow: Bool) async throws -> CGImage {
        let content = try await SCShareableContent.excludingDesktopWindows(
            false, onScreenWindowsOnly: true
        )
        guard let display = content.displays.first else {
            throw CaptureError.noDisplay
        }

        // Exclude VibeCat's own windows
        let excludedApps = content.applications.filter { app in
            app.bundleIdentifier == Bundle.main.bundleIdentifier
        }

        let filter = SCContentFilter(
            display: display,
            excludingApplications: excludedApps,
            exceptingWindows: []
        )
        let config = SCStreamConfiguration()
        config.width = 1280
        config.height = 720
        config.pixelFormat = kCVPixelFormatType_32BGRA
        config.showsCursor = false

        return try await SCScreenshotManager.captureImage(
            contentFilter: filter, configuration: config
        )
    }
}

Ran it in Xcode. Screen capture worked perfectly. Built a .app bundle with swift build. Ran it. Screen capture silently returned nothing. No error. No crash. Just... nothing.

The entitlement. The com.apple.security.screen-recording entitlement was in the Xcode project but wasn't getting embedded in the SPM-built binary. macOS doesn't throw an error when you try to capture without the entitlement — ScreenCaptureKit just quietly returns empty content. You get an empty displays array and no indication why.

I added it to VibeCat.entitlements and passed it via codesign:

codesign --force --entitlements VibeCat/VibeCat.entitlements \
  --sign - .build/release/VibeCat

First lesson: ScreenCaptureKit fails silently. If your capture returns nothing, check your entitlements before you check your code.

the image differ — because you don't send every frame

The companion captures your screen periodically, but you don't want to send every single frame to the backend. If your screen hasn't changed, there's nothing new to analyze. So I built a pixel-level change detector:

public enum ImageDiffer {
    private static let thumbnailSize = 32

    public static func hasSignificantChange(
        from previous: CGImage?,
        to current: CGImage,
        threshold: Double = 0.05
    ) -> Bool {
        guard let previous else { return true }
        guard let prevThumb = thumbnail(previous),
              let currThumb = thumbnail(current) else { return true }
        let diff = pixelDiff(prevThumb, currThumb)
        return diff > threshold
    }

    private static func pixelDiff(_ a: [UInt8], _ b: [UInt8]) -> Double {
        guard a.count == b.count, !a.isEmpty else { return 1.0 }
        var total: Double = 0
        for i in stride(from: 0, to: a.count, by: 4) {
            let dr = Double(a[i]) - Double(b[i])
            let dg = Double(a[i+1]) - Double(b[i+1])
            let db = Double(a[i+2]) - Double(b[i+2])
            total += sqrt(dr*dr + dg*dg + db*db) / (255.0 * sqrt(3.0))
        }
        return total / Double(a.count / 4)
    }
}

It's a static method on an enum (no cases — just a namespace for functions). Downscale both images to 32×32, compute Euclidean distance in RGB space per pixel, average across all pixels. If the difference exceeds 5%, it's a "significant change" worth sending.

Why enum instead of struct? Because a struct can be accidentally instantiated. An enum with no cases is pure namespace — you can't create an instance of ImageDiffer. It's a Swift pattern for grouping static utility functions.

Bundle.main.resourcePath — the Xcode lie

This one hurt. In SpriteAnimator, I needed to load PNG sprite frames from Assets/Sprites/cat/. First attempt:

let path = Bundle.main.resourcePath! + "/Assets/Sprites/\(char)"

Works perfectly in Xcode. The ! force-unwrap succeeds. Files are found. Sprites animate.

Run the same binary outside Xcode? Bundle.main.resourcePath is nil. Force-unwrap crashes. Silent death.

The issue: when Xcode runs your app, Bundle.main points to your project directory structure where everything is available. When you build with SPM and run the .app independently, Bundle.main.resourcePath often returns nil because resources aren't in the expected bundle location.

The fix was a findRepoRoot() function that walks up from both the working directory and the bundle URL:

private func findRepoRoot() -> URL {
    // Try working directory first
    var url = URL(fileURLWithPath: FileManager.default.currentDirectoryPath)
    for _ in 0..<6 {
        if FileManager.default.fileExists(
            atPath: url.appendingPathComponent("Assets/Sprites").path
        ) {
            return url
        }
        url = url.deletingLastPathComponent()
    }
    // Fallback: walk up from bundle URL
    var bundleURL = Bundle.main.bundleURL
    for _ in 0..<6 {
        if FileManager.default.fileExists(
            atPath: bundleURL.appendingPathComponent("Assets/Sprites").path
        ) {
            return bundleURL
        }
        bundleURL = bundleURL.deletingLastPathComponent()
    }
    return URL(fileURLWithPath: FileManager.default.currentDirectoryPath)
}

It's not pretty. But it works in Xcode, in a standalone .app, and when running from the terminal in the repo root. I used the same pattern in BackgroundMusicPlayer for finding Assets/Music/.

the NSWindow.isVisible trap

Swift 6 strict concurrency plus AppKit is a minefield. Here's one that's particularly evil: NSWindow has a built-in property called isVisible. If you define your own stored property with the same name in a subclass or extension, Swift doesn't warn you — it just breaks.

I had:

class CompanionPanel: NSPanel {
    var isVisible: Bool = false  // ← shadows NSWindow.isVisible
}

This compiles. It even seems to work at first. But NSWindow.isVisible is a computed property tied to the window server. My stored property hid it. Window visibility checks started returning wrong values. The panel would appear/disappear at random.

The fix was just a rename:

var hudVisible: Bool = false

No warning from the compiler. No runtime error. Just subtle incorrectness that took hours to track down.

@MainActor everywhere

Swift 6 requires @MainActor on anything that touches AppKit. In Swift 5 you could get away with updating UI from background threads — the app would work until it didn't. Swift 6 is strict: if a class touches NSWindow, NSImage, or any AppKit type, it must be @MainActor.

Every service class in VibeCat is @MainActor:

@MainActor
final class ScreenCaptureService { ... }

@MainActor
final class SpriteAnimator { ... }

@MainActor
final class BackgroundMusicPlayer { ... }

But Timer callbacks aren't @MainActor by default. So this pattern:

Timer.scheduledTimer(withTimeInterval: 0.12, repeats: true) { _ in
    self.advanceFrame()  // ❌ Not on MainActor
}

Has to become:

Timer.scheduledTimer(withTimeInterval: 0.12, repeats: true) { [weak self] _ in
    Task { @MainActor [weak self] in
        self?.advanceFrame()  // ✅ MainActor
    }
}

Every Timer. Every callback. Every closure that touches UI. Wrap it in Task { @MainActor in }. Swift 6 is safer, but the migration tax is real.

the client is deliberately dumb

One design principle I'm proud of: the macOS client is deliberately dumb. It captures screens, plays audio, animates sprites, and shuttles data to the backend. It makes zero AI decisions.

When the client captures a screenshot, it doesn't analyze it — it sends the raw image to the backend's /analyze endpoint. When the backend says "set character to surprised," the client just changes the sprite state. When the backend says "play this audio," the client plays it.

This is a challenge requirement (all AI through backend), but it's also good architecture. The client is ~1,970 lines of Swift. The backend is ~2,900 lines of Go. If I need to change how VibeCat responds to errors, I never touch the client.

The smartest thing the client does is the ImageDiffer — and even that is just an optimization to avoid sending unchanged frames, not an AI decision.

what I'd do differently

If I were starting over:

Test outside Xcode from day one. Every feature should be verified as a standalone .app, not just in the Xcode debug session. The silent failures cost me a full day.
Use a resource bundle properly. The findRepoRoot() hack works, but it's fragile. A proper SPM resource bundle with Bundle.module would be cleaner.
Start with Swift 6 strict concurrency enabled. I started with Swift 5 mode and migrated. The migration was painful — dozens of @MainActor annotations and callback wraps. Starting strict would have caught these at write-time instead of all-at-once.

But it works. The cat sees your screen. The sprites animate. The music plays. And the client stays dumb enough to let the backend do the thinking.

The moment I ran the codesigned .app outside Xcode for the first time — double-clicked it from Finder, no debugger, no safety net — and the cat appeared on my desktop, captured my screen, and waved at me? That was the best moment of this entire project. Three days of silent failures, for ten seconds of a pixel cat saying hello.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

the day vibecat stopped being a screen-watching demo

KimSejun — Wed, 11 Mar 2026 16:13:00 +0000

the day vibecat stopped being a screen-watching demo

I created this post for the purposes of entering the Gemini Live Agent Challenge, and this was the day the project got a lot less cute and a lot more real.

For a while, the easy way to describe VibeCat was: "it's a cat on your desktop that watches your screen and comments on what you're doing."

That line worked. People got it immediately. It also let me hide from the harder question.

Can it actually do anything useful?

Watching is a good demo. Acting is a product.

And to be fair to the earlier version: that screen-watching phase was not fake work. It taught me what context mattered, what annoyed me, and what the system kept getting almost-right. It was just incomplete.

The moment that became obvious was text entry. If the user says, "type this here," there are only two acceptable outcomes:

the system finds the right field and types the text
the system says it cannot safely verify the target

Everything else is fake confidence.

where the old framing broke

The old companion framing made it easy to focus on observation.

what app is open
what error is visible
whether the user sounds frustrated
whether the current screen looks important

That is all still useful. But none of it answers the practical question of whether the current focused thing is actually the search field in front of you.

A screenshot can tell you a lot. It cannot give you permission to click blindly.

That was the product boundary I needed to respect.

the first time it felt real

The first really convincing moment was not a giant workflow. It was tiny.

Chrome was open. The docs site was already on screen. I said, "type gemini live api here."

The client checked the frontmost app, the focused element, and the accessibility role. The worker verified that the target looked like text input. Then it inserted the text and refreshed context afterward.

That was it.

No fireworks. No theatrical demo beat. Just a boring action landing in the right place.

That was the moment VibeCat stopped feeling like a mascot wrapped around an LLM and started feeling like a UI navigator.

what the product promise became

The contract is much sharper now.

If intent is clear and the action is low-risk, VibeCat acts.

If the request is ambiguous, it asks one short question.

If the request is risky, it stops and asks for explicit confirmation.

If the target is unclear, it drops to guided mode instead of guessing.

That last part matters the most. There is a huge difference between "I think the input field is somewhere near the top left" and "I found a focused text input and verified it after insertion."

The first one sounds smart. The second one is actually useful.

why the screen still matters

The screen did not become irrelevant. It just stopped being the whole story.

Now the useful context is a combination of:

current app
window title
focused element role and label
selected text
accessibility snapshot
the latest visual state

That combination is what turns a passive observer into an executor.

The screen tells you what world you are in. Accessibility tells you what object you can safely touch. Verification tells you whether the action actually landed.

That triangle is the product.

the trade i made on purpose

This pivot cost me some of the original companion magic.

The older version had more ambient personality. It could feel like a creature hanging out on your desktop and reacting to your mood.

I still like that version emotionally.

But for a real product, and especially for a challenge entry, "acts safely on natural intent" is a much stronger promise than "sometimes notices things on its own."

That is the trade I made, and I think it was the right one.

The cat is still there. The voices are still there. The screen analysis still matters. But now those things serve the action loop instead of replacing it.

And once I saw that clearly, I couldn't go back to the older pitch.

The cat can still watch your screen.

It just has a job now.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

making go speak real-time — our gemini live api websocket proxy

KimSejun — Wed, 11 Mar 2026 15:24:00 +0000

making Go speak real-time — our Gemini Live API WebSocket proxy

The first time I got the audio proxy working, the cat meowed in Gemini's voice — a full 3 seconds of distorted PCM noise that sounded like a dial-up modem possessed by a cheerful robot. I'd set the sample rate wrong. 24kHz audio interpreted as 16kHz sounds like a cursed lullaby.

I created this post for the purposes of entering the Gemini Live Agent Challenge. I'm building VibeCat.

The core challenge was simple to state, hard to build: the macOS client can't talk to Gemini directly. Challenge rules require a backend, and you never put API keys on someone's Mac. So I needed a WebSocket proxy in Go that sits between the Swift client and Gemini Live API — receiving raw audio from one side, forwarding it to the other, and doing it fast enough that conversation feels natural.

the architecture (deceptively simple)

Swift Client ←→ [wss://gateway/ws/live] ←→ Go Gateway ←→ Gemini Live API
     PCM 16kHz mono →                                    → PCM 16kHz
                    ← PCM 24kHz                          ← PCM 24kHz

On paper, it's a pipe. Audio goes in one side, comes out the other. I told myself this would take a day. It took three. The first day was the "it works!" day. The second was the "why did it stop working?" day. The third was the "oh, WebSocket connections are secretly fragile" day.

connecting to Gemini

After the modem-cat incident, I triple-checked sample rates. The GenAI Go SDK makes the connection surprisingly clean:

session, err := m.client.Live.Connect(ctx, "gemini-2.0-flash-live-001", liveConfig)

One line. But building that liveConfig is where it gets interesting:

func buildLiveConfig(cfg Config) *genai.LiveConnectConfig {
    lc := &genai.LiveConnectConfig{}

    if cfg.Voice != "" {
        lc.SpeechConfig = &genai.SpeechConfig{
            VoiceConfig: &genai.VoiceConfig{
                PrebuiltVoiceConfig: &genai.PrebuiltVoiceConfig{
                    VoiceName: cfg.Voice,  // "Zephyr", "Puck", etc.
                },
            },
        }
    }

    lc.RealtimeInputConfig = &genai.RealtimeInputConfig{
        AutomaticActivityDetection: &genai.AutomaticActivityDetection{
            Disabled: false,  // VAD must be enabled — challenge requirement
        },
    }

    return lc
}

VAD (Voice Activity Detection) is mandatory. When AutomaticActivityDetection is enabled, Gemini handles turn-taking automatically — it detects when you stop talking and starts responding. It also supports barge-in: if you interrupt mid-response, Gemini stops and listens.

audio streaming

Sending audio to Gemini:

func (s *Session) SendAudio(pcmData []byte) error {
    return s.gemini.SendRealtimeInput(genai.LiveRealtimeInput{
        Audio: &genai.Blob{
            MIMEType: "audio/pcm;rate=16000",
            Data:     pcmData,
        },
    })
}

The MIME type matters. audio/pcm;rate=16000 means raw PCM, 16-bit, 16kHz, mono. I know because I got it wrong — passed audio/pcm without the rate parameter, and Gemini interpreted my voice as white noise. No error. No warning. Just silence on the other end and me talking to myself in an empty apartment at midnight.

Receiving from Gemini is a loop that runs in its own goroutine:

func receiveFromGemini(ctx context.Context, conn *websocket.Conn, sess *live.Session, connID string) {
    for {
        msg, err := sess.Receive()
        if err != nil {
            return
        }

        if msg.ServerContent != nil && msg.ServerContent.ModelTurn != nil {
            for _, part := range msg.ServerContent.ModelTurn.Parts {
                if part.InlineData != nil && len(part.InlineData.Data) > 0 {
                    conn.WriteMessage(websocket.BinaryMessage, part.InlineData.Data)
                }
            }
        }

        if msg.ServerContent != nil && msg.ServerContent.TurnComplete {
            sendJSON(conn, map[string]string{"type": "turnComplete"})
        }

        if msg.ServerContent != nil && msg.ServerContent.Interrupted {
            sendJSON(conn, map[string]string{"type": "interrupted"})
        }
    }
}

Gemini sends audio in chunks via InlineData.Data. Each chunk is a PCM frame at 24kHz that goes straight to the client as a binary WebSocket message. Text events (transcriptions, turn completions, interruptions) go as JSON text frames.

the zombie killer

Day two's lesson: WebSocket connections die in weird ways. The client closes their laptop. The network drops. The process crashes. In all these cases, the server-side connection sits there, alive but silent — a zombie. I found this out because my test server accumulated 14 dead connections over a weekend. Each one holding a Gemini Live session open. Each one costing API credits for nothing.

const (
    pingInterval  = 15 * time.Second
    zombieTimeout = 45 * time.Second
)

rawConn.SetReadDeadline(time.Now().Add(zombieTimeout))
rawConn.SetPongHandler(func(string) error {
    rawConn.SetReadDeadline(time.Now().Add(zombieTimeout))
    return nil
})

// Ping goroutine
go func() {
    ticker := time.NewTicker(pingInterval)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            rawConn.WriteControl(websocket.PingMessage, nil, time.Now().Add(5*time.Second))
        }
    }
}()

Every 15 seconds, the server pings the client. If the client doesn't pong within 45 seconds, the read deadline expires and the connection gets cleaned up. The Gemini session closes, the registry removes the connection, and resources are freed.

session resumption

Gemini Live sessions have a time limit. When the server sends a GoAway signal, you have a few seconds to save the resumption handle and reconnect:

if msg.SessionResumptionUpdate != nil && msg.SessionResumptionUpdate.NewHandle != "" {
    sess.ResumptionHandle = msg.SessionResumptionUpdate.NewHandle
    sendJSON(conn, map[string]any{
        "type":             "setupComplete",
        "sessionId":        connID,
        "resumptionHandle": sess.ResumptionHandle,
    })
}

The client saves the handle. On reconnect, it sends the handle in the setup message, and the gateway passes it to SessionResumptionConfig. Gemini picks up where it left off. No lost context, no repeated introductions.

JWT auth

Every WebSocket connection requires a valid JWT:

mux.Handle("/ws/live", auth.Middleware(jwtMgr, ws.Handler(registry, liveMgr, adkClient)))

The client first calls POST /api/v1/auth/register with an API key, gets back a signed JWT with 24-hour expiry, then passes it as Bearer <token> in the WebSocket upgrade request. No token, no connection. Bad token, 401.

The whole gateway is about 300 lines of WebSocket handler code and 170 lines of Live session management. Not counting the auth layer. For a real-time bidirectional audio proxy with authentication, session resumption, and zombie detection — that's compact.

But the line count doesn't capture the real work. The real work was the modem-cat at midnight, the 14 zombie connections leaking credits, the missing MIME parameter that turned my voice into silence. The code is simple because I made every mistake first.

The proxy works now. Audio goes in, the cat talks back, and it sounds like an actual voice — not a dial-up modem anymore. That feels like progress.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

why i stopped letting nine agents argue over one click

KimSejun — Wed, 11 Mar 2026 10:40:54 +0000

why i stopped letting nine agents argue over one click

I created this post for the purposes of entering the Gemini Live Agent Challenge, but this one is really about admitting I was solving the wrong problem for a while.

For a few days, VibeCat looked incredible in architecture diagrams. I had named agents. I had parallel waves. I had boxes for mood, celebration, engagement, memory, search, mediation. Every time I added one more box, the system felt more sophisticated.

Then I tried to use it for an actual desktop action.

Not a grand demo. Something boring. "Open the official docs." The kind of request that should feel instant.

And that was the moment the architecture stopped feeling smart and started feeling expensive.

The graph itself wasn't wrong. It was just sitting in the wrong part of the product.

Those older posts were honest snapshots of the project at the time. The graph solved real problems. It just wasn't the thing that should own every user-facing action.

the embarrassing realization

I had been treating "how many capabilities exist" as if it were the same question as "how many active decision-makers should be in the hot path."

Those are not the same thing.

VibeCat absolutely does have many capabilities. It can analyze the screen, keep memory, do research, reason about ambiguity, classify risk, and decide whether a step should run locally or not.

But when the user says something concrete like:

"open the official docs"
"type this in the search box"
"run that again"

nobody cares that the internal graph is elegant. They care whether the system moves now, and whether it moves safely.

I had built a system that was very good at explaining itself and not yet strict enough about acting.

what changed

The turning point was realizing that the product is easier to understand in three planes:

Gemini Live + VAD      -> talks to the user
navigator worker       -> decides the next safe step
local macOS executor   -> actually focuses, types, clicks, verifies

That is the part I should have led with from the beginning.

The always-on Live session is the PM. It handles the messy human side: interruptions, vague requests, clarification, short confirmations, "no, not that tab, the other one."

The worker is much less charming. It has one job: take an actionable request, classify it, decide whether it is ambiguous or risky, plan one step, then wait for verification.

The local executor is narrower still. It looks at the current app, the focused element, the AX tree, and the current window state, then tries to perform exactly one step without pretending confidence it doesn't have.

Once I drew the system that way, the product made more sense immediately.

the part i did not throw away

This is the part I wish I had explained better in the public posts: I did not "discover that multi-agent systems are fake" or anything dramatic like that.

The 9-agent graph was useful. It still is useful.

It is just better as a background intelligence lane than as the thing that every single UI action has to march through.

Memory still helps. Research still helps. Low-confidence screen analysis still helps. Session summaries still help. Multimodal checks still help.

But those capabilities should come in when they add accuracy, not because I am emotionally attached to the architecture.

That was the real pivot: the intelligence stayed, but it moved behind the worker.

one rule fixed half the product

The biggest practical improvement came from one boring rule:

only one executable task can be active at a time.

Before that, a lot of weird bugs shared the same root cause:

the right action in the wrong app
typing into the wrong field after the UI changed
continuing an old plan because a stale refresh arrived late
silently juggling two user intents at once and doing neither well

Once the system had exactly one current task, one current step, and one verification loop, a lot of the magic stopped being magical and started being debuggable.

That trade is worth it every time.

I would much rather have a desktop agent that feels slightly stricter than one that feels "clever" right up until it pastes into the wrong input field.

the request that made it obvious

The request that finally broke my attachment to the old framing was text entry.

If the user says, "type gemini live api here," the system cannot answer with a pretty explanation about context. It has to either find the field and type into it, or admit it cannot verify the target.

That means the hot path needs very boring things:

focus state
target identity
step ids
risk checks
post-action refresh
replacement logic if the user changes their mind mid-flight

That is not where I want a council of equal agents debating the meaning of the moment.

That is where I want one worker making one decision.

what this changed emotionally

This pivot also fixed something less technical: I stopped feeling like I had to constantly defend the architecture.

Before, when I described VibeCat, I kept reaching for "graph," "specialists," "waves," and "agents." Those words were accurate, but they were not the thing a user would actually trust.

Now the explanation is simpler, and that simplicity is earned:

there is one thing talking to you.
there is one thing deciding the next step.
there is one thing on your Mac that can do the step and verify it.

That is a product shape.

And honestly, it is the first version of the system that feels like it deserves to exist outside a demo.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

the graph was not wrong. it was just in the wrong place

KimSejun — Wed, 11 Mar 2026 10:40:17 +0000

the graph was not wrong. it was just in the wrong place

I created this post for the purposes of entering the Gemini Live Agent Challenge, and it is also my attempt to explain why two things I wrote this week appear to contradict each other when they actually don't.

A few posts ago I was deep in the 9-agent graph story.

That was real. I was excited about it for good reason. The graph gave VibeCat useful capabilities: screen analysis, memory, mood signals, search, celebration, speech gating. I still think that work mattered.

Then the project pivoted harder toward desktop UI navigation, and suddenly the thing I cared about most was not whether the graph was elegant. It was whether the system could safely do the next step in front of the user.

That made it sound like I had changed my mind completely.

I didn't.

I changed my mind about placement.

why the old posts were still true

When I wrote about nine agents, I was looking at the intelligence layer in isolation.

In that layer, decomposing the problem really did help. Separate pieces for memory, mood, celebration, search, and mediation made the analysis pipeline easier to tune. I could change frustration thresholds without touching celebration logic. I could change speech gating without touching memory retrieval. That was good engineering.

If all VibeCat had to do was observe, summarize, and occasionally comment, that architecture was pretty defensible.

what the ui navigator version changed

The problem is that UI action has a very different failure mode.

A bad analysis result is annoying.

A bad click is scary.

A slow summary is forgivable.

Typing into the wrong field is not.

Once I started treating VibeCat as a desktop UI navigator instead of just a companion, the hot path changed completely. The important question stopped being "how many specialists can contribute here?" and became "who is allowed to decide the next executable step right now?"

That answer turned out to be: not a crowd.

where the graph belongs now

The graph still belongs in the product. It just belongs behind the immediate action loop.

That means the structure now feels more honest:

Live PM          -> talks to the user
navigator worker -> decides the next safe step
local executor   -> performs and verifies the step
background graph -> helps when extra intelligence is actually useful

That last line is the important one.

The graph is not dead. It is no longer the thing that every single click has to wait on.

the practical example

If the user says, "open the official docs," I do not want a miniature parliament of agents trying to co-author the moment.

I want:

one worker to decide if the request is executable
one check for ambiguity
one risk check
one step
one verification result

If, later, the system needs more context because the target is unclear or the user seems stuck, then the slower intelligence lane can help.

That is a better use of the graph.

the part i had to swallow my pride about

I think a lot of us like architectures that sound impressive when we say them out loud.

I definitely do.

"Three parallel waves and nine specialist agents" sounds like progress. "One worker does one step at a time" sounds almost embarrassingly plain by comparison.

But the second version is closer to what a user can actually trust.

That was the hard part for me. Not building the new shape. Admitting the new shape was better.

so what changed, exactly?

Not the belief that decomposition can help.

Not the belief that background intelligence matters.

Not the belief that VibeCat needs memory, research, and multimodal context.

What changed was this:

I stopped asking the graph to own the exact moment where the product becomes risky.

That moment now belongs to a narrower worker with a stricter contract.

And that change made the whole project feel less like a cool diagram and more like a real tool.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

teaching nine agents to think like a colleague

KimSejun — Tue, 10 Mar 2026 06:09:39 +0000

I created this post for the purposes of entering the Gemini Live Agent Challenge. In my last post I walked through what VibeCat actually does — a macOS cat that watches your screen, hears your voice, and knows when to shut up. But I glossed over how it does all that. The cat isn't one thing — it's nine things pretending to be one thing, and getting that pretense right is the actual engineering problem.

let me start with the question that shaped everything: what does a colleague actually do?

not a chatbot. not a search engine. a colleague. the person sitting next to you who catches your typo on line 23 before you do, notices you've been stuck for 40 minutes, and knows when to shut up because you're in flow.

I spent a while listing the behaviors:

See your screen and notice errors
Remember yesterday's context
Sense frustration from patterns
Celebrate when tests pass
Decide whether to speak or stay silent
Adapt timing to your rhythm
Reach out when you've been too quiet
Search for answers when you're stuck

that's not one model doing one thing. that's eight distinct behaviors plus voice (VAD makes nine). so I decomposed the colleague into nine agents.

the graph

all nine agents run through Google ADK's workflow agents. the key insight: not all agents need each other's results. VisionAgent doesn't care about MemoryAgent's output. MoodDetector doesn't need CelebrationTrigger. so I split them into three waves:

// Wave 1 — Perception (parallel)
wave1, _ := parallelagent.New(parallelagent.Config{
    AgentConfig: agent.Config{
        Name:      "wave1_perception",
        SubAgents: []agent.Agent{visionAgent, memoryAgent},
    },
})

// Wave 2 — Emotion (parallel)
wave2, _ := parallelagent.New(parallelagent.Config{
    AgentConfig: agent.Config{
        Name:      "wave2_emotion",
        SubAgents: []agent.Agent{moodAgent, celebrationAgent},
    },
})

// Wave 3 — Decision (sequential, because each depends on the previous)
wave3, _ := sequentialagent.New(sequentialagent.Config{
    AgentConfig: agent.Config{
        Name:      "wave3_decision",
        SubAgents: []agent.Agent{mediatorAgent, schedulerAgent, engagementAgent, searchLoop},
    },
})

// The full graph
graph, _ := sequentialagent.New(sequentialagent.Config{
    AgentConfig: agent.Config{
        Name:      "vibecat_graph",
        SubAgents: []agent.Agent{wave1, wave2, wave3},
    },
})

waves 1 and 2 run in parallel — parallelagent fires both sub-agents simultaneously. wave 3 runs sequentially because the Mediator needs mood + celebration results, the Scheduler needs the Mediator's decision, and so on.

the result: ~35% latency reduction compared to running all 9 sequentially. from ~3.5 seconds down to ~2.1-2.5 seconds for the full graph. that matters when a developer is waiting for the cat to react to their screen.

the mediator problem

making AI talk is easy. every LLM wants to talk. the hard part is making it know when to shut up.

the Mediator agent is the gatekeeper. it reads everything — vision analysis, mood state, celebration events — and makes one binary decision: speak or stay silent. here's the core logic:

const (
    defaultCooldown  = 10 * time.Second
    moodCooldown     = 180 * time.Second
    highSignificance = 7
)

func (a *Agent) decide(vision *models.VisionAnalysis, mood *models.MoodState, celebration *models.CelebrationEvent) *models.MediatorDecision {
    // ... read from state, check cooldown, check flow state

    // celebration always bypasses cooldown
    if celebration != nil && celebration.Message != "" {
        return &models.MediatorDecision{ShouldSpeak: true, Reason: "celebration"}
    }

    // high significance + error = speak immediately
    if vision != nil && vision.Significance >= highSignificance && vision.ErrorDetected {
        return &models.MediatorDecision{ShouldSpeak: true, Reason: "error_detected", Urgency: "high"}
    }

    // flow state = extend cooldown, stay silent
    if isInFlowState(ctx) {
        return &models.MediatorDecision{ShouldSpeak: false, Reason: "flow_state"}
    }

    // ... more rules
}

but it gets more nuanced. the Mediator also tracks recent speech to avoid repeating itself:

func (a *Agent) isSimilarToRecent(text string) bool {
    // if we said something similar in the last 5 utterances, stay silent
}

and it generates mood-support messages dynamically using gemini-3.1-flash-lite-preview when it detects sustained frustration but hasn't spoken about mood in the last 3 minutes:

if mood != nil && !decision.ShouldSpeak {
    sinceMood := time.Since(a.lastMoodSpoke)
    if sinceMood > moodCooldown {
        msg := a.generateMoodMessage(ctx, mood, vision, language)
        if msg != "" {
            decision.ShouldSpeak = true
            decision.Reason = "mood_support"
            a.lastMoodSpoke = time.Now()
        }
    }
}

no hardcoded messages. every utterance is generated by LLM, considering the developer's current context, mood, language, and what they're working on. the hardcoded pool exists only as a fallback if LLM generation fails.

multimodal mood detection

the MoodDetector doesn't just look at text. it fuses three signals:

Vision signals — error frequency, repeated errors (same error 3+ times = frustrated), app switches
Voice tone — from Gemini's AffectiveDialog, the Live API reports the emotional tone of the user's voice
Temporal patterns — how long since last interaction, silence duration, error-to-fix time

voiceTone, voiceConfidence := readVoiceToneFromState(ctx)
mood := a.classify(vision, voiceTone, voiceConfidence)

the voice tone comes from ADK session state — the gateway extracts it from the Live API's AffectiveDialog output and writes it to voice_tone in the session state. the MoodDetector reads it alongside the vision analysis to produce a fused mood classification.

this is genuinely multimodal — not just "look at the screen" or "listen to the voice" but both, simultaneously, informing a single emotional model.

rest reminders and proactive engagement

the EngagementAgent handles two kinds of proactive behavior:

silence engagement — if the developer hasn't interacted in 3 minutes, it speaks up:

if sinceLast > silenceThreshold {
    result.Decision.ShouldSpeak = true
    result.Decision.Reason = "silence_engagement"
    result.SpeechText = a.generateSilenceMessage(ctx, language)
}

rest reminders — the client tracks activityMinutes from session start and sends it with every screen capture. after 50 minutes of continuous coding:

const restReminderInterval = 50 * time.Minute
const restReminderCooldown = 30 * time.Minute

if activityMin >= int(restReminderInterval.Minutes()) && sinceLastReminder > restReminderCooldown {
    result.Decision.ShouldSpeak = true
    result.Decision.Reason = "rest_reminder"
    result.SpeechText = a.generateRestMessage(ctx, lang, activityMin)
}

the full pipeline: macOS client calculates minutes since session start → sends activityMinutes in the WebSocket payload → Gateway passes it to Orchestrator in POST /analyze → EngagementAgent reads it from session state → triggers LLM-generated rest suggestion in the developer's language.

adk advanced features

VibeCat doesn't just use ADK's basic agents. it uses the advanced stuff:

retryandreflect plugin — if an agent fails (network timeout, LLM error), it automatically reflects on why it failed and retries:

import "google.golang.org/adk/plugin/retryandreflect"

r, _ := runner.New(runner.Config{
    Agent:   graphAgent,
    Plugins: []runner.Plugin{retryandreflect.New(retryandreflect.WithTrackingScope(retryandreflect.Invocation))},
})

loopagent — the SearchBuddy is wrapped in a loop agent that runs up to 2 iterations, refining search results:

searchLoop, _ := loopagent.New(loopagent.Config{
    AgentConfig: agent.Config{
        Name:      "search_refinement_loop",
        SubAgents: searchSubAgents,
    },
    MaxIterations: 2,
})

BeforeModel/AfterModel callbacks — the LLM search agent has callbacks for logging and guard-rails:

llmSearchAgent, _ := llmagent.New(llmagent.Config{
    BeforeModelCallback: func(ctx agent.CallbackContext, req *model.LLMRequest) (*model.LLMResponse, error) {
        slog.Info("[LLM_SEARCH] before model", "agent", ctx.AgentName())
        return nil, nil
    },
    AfterModelCallback: func(ctx agent.CallbackContext, resp *model.LLMResponse, err error) (*model.LLMResponse, error) {
        slog.Info("[LLM_SEARCH] after model", "agent", ctx.AgentName(), "has_error", err != nil)
        return resp, nil
    },
})

14 ADK features total. agent.New, sequentialagent, parallelagent, loopagent, llmagent, session.InMemoryService, memory.InMemoryService, runner.New, telemetry, session.State, functiontool, geminitool.GoogleSearch, retryandreflect, and BeforeModel/AfterModel callbacks.

what I learned

the hardest thing about building a multi-agent system isn't the graph. it's the boundaries. when does MoodDetector's responsibility end and Mediator's begin? who owns the "should I speak" decision when both EngagementAgent and Mediator have opinions?

the answer that worked: each agent writes to session state, and downstream agents read from it. no agent calls another agent directly. the graph topology IS the API contract. Vision writes vision_analysis to state. Mood reads it and writes mood_state. Mediator reads both. clean, testable, and you can swap any agent without touching the others.

nine agents. three waves. one decision. and a cat that knows when to shut up.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat