## 💡 Inspiration
Every developer knows the feeling: you're in a deep flow state — three terminals open, a half-written function in your head — and you need to look something up. You alt-tab, open a browser, lose your place, lose your thought. The AI assistant is in another window, waiting for you to come to it.
I wanted to flip that. What if the AI came to you? Not in a sidebar. Not a chatbot. Just... there, invisible, floating above every window — summoned with one keystroke and gone just as fast.
That's the idea behind GhostOps: an AI that lives on your screen like a ghost. You never have to context-switch to talk to it.
## 🏗️ How I Built It
### The Overlay
The foundation is an Electron window
with transparent: true, frame:
false, and alwaysOnTop:
'screen-saver'. On macOS this is a
panel-type window — it floats above
everything, including full-screen apps
and Mission Control. By default it's
focusable: false and ignores all
mouse events, so it's completely
invisible to your workflow.
When you press ⌘ + Shift + Space, a
command bar fades up. The window
becomes interactive just long enough
for you to type or speak your request
— then retreats to ghost mode.
### Gemini as the Brain
Every command is routed by Gemini 2.5 Flash to one of six specialist agents:
- direct_response — instant answers without tools
- invoke_screen_annotator — takes a screenshot, sends it to Gemini multimodal, draws floating bounding boxes over the identified elements
- invoke_cua_vision — computer use: screenshot → Gemini 2.5 Flash → tool calls (move cursor, click, type) → repeat until complete
- invoke_cua_cli — shell commands,
file ops,
open -a AppName - invoke_browser — full Playwright browser automation
- request_screen_context — reads the screen first, then decides what to do
The routing itself uses Gemini's function calling. A single structured output decides which agent handles the task — no keyword matching, no regex trees.
### Gemini Live API for Voice
The most magical part: hold the mic button and just talk. The Gemini Live API streams bidirectional audio — you can interrupt mid-sentence and GhostOps adapts in real time. The same six agents are available over voice, dispatched the same way.
### Workflow Recording and Replay
GhostOps can watch you work and learn to replicate it. Say "watch me" — it starts screenshotting every 2 seconds and transcribing your keystrokes. Say "remember this as setup-project" — it sends the final frame to Gemini multimodal, extracts a structured list of steps as JSON, and saves them to Firestore.
Next time: "replay setup-project" — GhostOps runs through each step using the vision agent, taking a fresh screenshot at each step to find and interact with the correct element on screen.
### Cloud Backend
The persistent memory and workflow store runs on Google Cloud Run (FastAPI) with Firestore as the database. On startup, GhostOps fetches your last N conversation turns and injects them as memory into the system prompt — so it remembers context across restarts.
## 🧗 Challenges
### The Mouse Problem
Electron's focusable: false windows
on macOS don't receive mousemove
events. Ever. This is by design — a
ghost window can't intercept events
meant for real windows underneath. But
drag-to-reposition needs mousemove.
The solution: a cursor poller. Every
30ms, the Electron main process calls
screen.getCursorScreenPoint() and
sends the coordinates to the renderer
via IPC. The renderer checks if the
cursor is hovering over any
interactive element and updates drag
state accordingly. Completely
invisible to the OS, perfectly smooth.
### The Overlay Input Bug
Recording a workflow leaves the
overlay in overlayActive = true
state (the bar is collapsed but
technically shown). The next ⌘ +
Shift + Space hits an early-return
guard — if (overlayActive &&
!overlayClosing) return — and
silently does nothing. The user stares
at a blank screen wondering if the
app crashed.
The fix: call
forceResetCommandOverlay() before
showCommandOverlay() in the hotkey
handler, wiping all state so the bar
always appears cleanly.
### Screen Resolution in the Vision Loop
The computer-use agent takes a
screenshot and asks Gemini to return
bounding box coordinates for the
target element. But screenshots are
captured at native retina resolution
(4000×1440 on my machine) while
coordinates need to match the logical
screen space for pyautogui. Getting
the coordinate space mapping right —
accounting for device pixel ratio,
window scaling, and multi-monitor
offsets — took more iteration than any
other part of the project.
### Gemini Live API Latency
The voice session uses a persistent WebSocket to the Gemini Live API. The challenge was threading: the audio capture runs in one thread, the agent dispatch runs in another, and the overlay renders in the Electron main thread. Getting all three to communicate without deadlocks — especially when an agent is mid-execution and a new voice command arrives — required careful session state management.
## 📚 What I Learned
- Gemini 2.5 Flash is genuinely good at screen understanding. Give it a screenshot and ask "where is the New Tab button?" — it finds it. This makes the computer-use loop surprisingly reliable.
- Multimodal routing is more
powerful than text routing. The
request_screen_contextagent pre-sends the screen to Gemini before routing, so "open this repo in Cursor" works even if you never typed the repo name — Gemini reads it off the screen. - The Gemini Live API changes how interaction feels. Text input has latency you notice. Streaming audio feels like talking to someone in the room.
- Transparent overlays are deep OS-level territory. Between window types, z-order, focus management, and accessibility permissions, building something that truly sits above all other windows without breaking normal desktop behavior is a surprisingly complex systems problem.
## 🔮 What's Next
- Proactive heartbeats — GhostOps checks in on you based on what it last knew you were working on
- Browser workflow recording — capture Playwright traces, not just desktop screenshots
- Multi-user Firestore isolation — proper per-user memory with auth
- Windows support — the pyautogui
- Electron stack is already cross-platform
Built With
- electron
- gemini
- google-cloud-run
- javascript
- pyautogui
- python
- websockets
Log in or sign up for Devpost to join the conversation.