## 💡 Inspiration

Every developer knows the feeling: you're in a deep flow state — three terminals open, a half-written function in your head — and you need to look something up. You alt-tab, open a browser, lose your place, lose your thought. The AI assistant is in another window, waiting for you to come to it.

I wanted to flip that. What if the AI came to you? Not in a sidebar. Not a chatbot. Just... there, invisible, floating above every window — summoned with one keystroke and gone just as fast.

That's the idea behind GhostOps: an AI that lives on your screen like a ghost. You never have to context-switch to talk to it.


## 🏗️ How I Built It

### The Overlay

The foundation is an Electron window with transparent: true, frame: false, and alwaysOnTop: 'screen-saver'. On macOS this is a panel-type window — it floats above everything, including full-screen apps and Mission Control. By default it's focusable: false and ignores all mouse events, so it's completely invisible to your workflow.

When you press ⌘ + Shift + Space, a command bar fades up. The window becomes interactive just long enough for you to type or speak your request — then retreats to ghost mode.

### Gemini as the Brain

Every command is routed by Gemini 2.5 Flash to one of six specialist agents:

  • direct_response — instant answers without tools
  • invoke_screen_annotator — takes a screenshot, sends it to Gemini multimodal, draws floating bounding boxes over the identified elements
  • invoke_cua_vision — computer use: screenshot → Gemini 2.5 Flash → tool calls (move cursor, click, type) → repeat until complete
  • invoke_cua_cli — shell commands, file ops, open -a AppName
  • invoke_browser — full Playwright browser automation
  • request_screen_context — reads the screen first, then decides what to do

The routing itself uses Gemini's function calling. A single structured output decides which agent handles the task — no keyword matching, no regex trees.

### Gemini Live API for Voice

The most magical part: hold the mic button and just talk. The Gemini Live API streams bidirectional audio — you can interrupt mid-sentence and GhostOps adapts in real time. The same six agents are available over voice, dispatched the same way.

### Workflow Recording and Replay

GhostOps can watch you work and learn to replicate it. Say "watch me" — it starts screenshotting every 2 seconds and transcribing your keystrokes. Say "remember this as setup-project" — it sends the final frame to Gemini multimodal, extracts a structured list of steps as JSON, and saves them to Firestore.

Next time: "replay setup-project" — GhostOps runs through each step using the vision agent, taking a fresh screenshot at each step to find and interact with the correct element on screen.

### Cloud Backend

The persistent memory and workflow store runs on Google Cloud Run (FastAPI) with Firestore as the database. On startup, GhostOps fetches your last N conversation turns and injects them as memory into the system prompt — so it remembers context across restarts.


## 🧗 Challenges

### The Mouse Problem

Electron's focusable: false windows on macOS don't receive mousemove events. Ever. This is by design — a ghost window can't intercept events meant for real windows underneath. But drag-to-reposition needs mousemove.

The solution: a cursor poller. Every 30ms, the Electron main process calls screen.getCursorScreenPoint() and sends the coordinates to the renderer via IPC. The renderer checks if the cursor is hovering over any interactive element and updates drag state accordingly. Completely invisible to the OS, perfectly smooth.

### The Overlay Input Bug

Recording a workflow leaves the overlay in overlayActive = true state (the bar is collapsed but technically shown). The next ⌘ + Shift + Space hits an early-return guard — if (overlayActive && !overlayClosing) return — and silently does nothing. The user stares at a blank screen wondering if the app crashed.

The fix: call forceResetCommandOverlay() before showCommandOverlay() in the hotkey handler, wiping all state so the bar always appears cleanly.

### Screen Resolution in the Vision Loop

The computer-use agent takes a screenshot and asks Gemini to return bounding box coordinates for the target element. But screenshots are captured at native retina resolution (4000×1440 on my machine) while coordinates need to match the logical screen space for pyautogui. Getting the coordinate space mapping right — accounting for device pixel ratio, window scaling, and multi-monitor offsets — took more iteration than any other part of the project.

### Gemini Live API Latency

The voice session uses a persistent WebSocket to the Gemini Live API. The challenge was threading: the audio capture runs in one thread, the agent dispatch runs in another, and the overlay renders in the Electron main thread. Getting all three to communicate without deadlocks — especially when an agent is mid-execution and a new voice command arrives — required careful session state management.


## 📚 What I Learned

  • Gemini 2.5 Flash is genuinely good at screen understanding. Give it a screenshot and ask "where is the New Tab button?" — it finds it. This makes the computer-use loop surprisingly reliable.
  • Multimodal routing is more powerful than text routing. The request_screen_context agent pre-sends the screen to Gemini before routing, so "open this repo in Cursor" works even if you never typed the repo name — Gemini reads it off the screen.
  • The Gemini Live API changes how interaction feels. Text input has latency you notice. Streaming audio feels like talking to someone in the room.
  • Transparent overlays are deep OS-level territory. Between window types, z-order, focus management, and accessibility permissions, building something that truly sits above all other windows without breaking normal desktop behavior is a surprisingly complex systems problem.

## 🔮 What's Next

  • Proactive heartbeats — GhostOps checks in on you based on what it last knew you were working on
  • Browser workflow recording — capture Playwright traces, not just desktop screenshots
  • Multi-user Firestore isolation — proper per-user memory with auth
  • Windows support — the pyautogui
  • Electron stack is already cross-platform

Built With

Share this project:

Updates