Built for the DigitalOcean Gradient AI Hackathon A transparent, always-on desktop overlay powered by DigitalOcean Gradient AI that sees, hears, speaks, and acts.
GhostOps is a transparent Electron overlay that sits invisibly above every window on your desktop. Press one shortcut and it appears — ready to answer questions, annotate your screen, control your computer, automate browser tasks, or learn and replay your entire workflows.
All AI inference is powered by DigitalOcean Gradient AI's serverless inference — routing decisions, vision understanding, and tool calling all run through DO's OpenAI-compatible endpoint with models like llama3.3-70b-instruct and openai-gpt-4o.
It's not a chatbot in a window. It is the window.
You press Cmd+Shift+Space
|
v
+--------------------------------------+
| Hey Kanishkha -- what do you need? | <-- floating over your real screen
+--------------------------------------+
|
You type: "watch me set up this repo"
|
v
GhostOps records every action you take
Then replays it perfectly on any machine
Watch the demo video — Screen annotation -> CLI control -> Mouse automation -> Workflow learning
GhostOps is built on DigitalOcean Gradient AI as its primary AI infrastructure:
| DO Gradient AI Feature | How GhostOps Uses It |
|---|---|
| Serverless Inference API | All LLM calls route through https://inference.do-ai.run/v1/ — text generation, vision analysis, and function calling |
| Model Access Keys | Authentication via DO Model Access Keys for secure, scoped API access |
| llama3.3-70b-instruct | Powers the intelligent agent router — classifies user intent and delegates to the right specialist agent |
| openai-gpt-4o (via DO) | Vision model for screenshot analysis, screen understanding, element detection, and GUI automation |
| OpenAI-Compatible API | Drop-in integration via the standard openai Python SDK, pointing at DO's inference endpoint |
| Multi-Model Catalog | Access to 30+ models (Claude, GPT, Llama, DeepSeek, Nemotron) through a single endpoint |
| App Platform | Backend API deployed on DO App Platform — auto-scaling, zero-ops hosting (Live) |
User Input (text or voice)
|
v
+---------------------------------------------------+
| DO Gradient AI - Serverless Inference |
| https://inference.do-ai.run/v1/ |
| |
| llama3.3-70b --> Agent Router (classify intent) |
| openai-gpt-4o --> Vision (screen understanding) |
| llama3.3-70b --> CLI Agent (command generation) |
+--------+------------------------------------------+
|
v
+---------------------------------------------------+
| DO App Platform — Backend API |
| https://clownfish-app-dqd9h.ondigitalocean.app |
| FastAPI (vision, memory, health) |
+--------+------------------------------------------+
|
v
+---------------------------------------------------+
| Local Desktop Agent (Python + Electron) |
| pyautogui (mouse/KB) + Playwright (browser) |
| Electron overlay (transparent, always-on-top) |
+---------------------------------------------------+
+-------------------------------------------------------------------------+
| USER'S DESKTOP |
| |
| +------------------------------------------------------------------+ |
| | ELECTRON OVERLAY (always on top) | |
| | Transparent, focusable-on-demand panel | |
| | Canvas: bounding boxes, dots, annotation text | |
| | Command bar: text input + voice mic + drag handle | |
| | Status bubbles: real-time task progress | |
| +---------------------+--------------------------------------------+ |
| | WebSocket (ws://127.0.0.1:PORT) |
+-------------------------+------------------------------------------------+
|
+--------------------------+-----------------------------------------------+
| PYTHON CORE (app.py) |
| |
| +------------------------------------------------------------------+ |
| | MULTI-AGENT ROUTER (DO Gradient AI) | |
| | llama3.3-70b classifies intent, delegates to specialists | |
| | | |
| | +----------+ +----------+ +----------+ +----------+ | |
| | | answer | | annotate | | control | | browse | | |
| | | directly | | screen | | computer | | web | | |
| | +----------+ +----------+ +----------+ +----------+ | |
| | +----------+ +----------+ +----------+ +----------+ | |
| | | run_shell| | read | | workflow | | workflow | | |
| | | command | | screen | | record | | replay | | |
| | +----------+ +----------+ +----------+ +----------+ | |
| +------------------------------------------------------------------+ |
| |
| +------------------+ +------------------+ +-------------------+ |
| | DO Gradient AI | | DO App Platform | | Google Cloud | |
| | Serverless | | Backend API | | +- Firestore | |
| | Inference | | (FastAPI) | | | (memory) | |
| | (vision + text) | | /vision /memory | | +- Gemini Live | |
| +------------------+ +------------------+ | | (voice) | |
| +-------------------+ |
+--------------------------------------------------------------------------+
| Feature | Description | Example Command |
|---|---|---|
| Direct Q&A | Instant answers via DO inference | "what is 42 x 37" |
| Screen Annotation | Floating bounding boxes over live UI | "what's on my screen" |
| Computer Use | Sees screen, moves cursor, clicks | "click the new note button" |
| CLI Control | Shell commands, file ops, open apps | "open notion" |
| Browser Agent | Full Playwright web automation | "search google for X" |
| Screen Context | Reads screen then acts on what it sees | "open this repo in Cursor" |
| Voice Input | STT via mic button | Click mic in overlay |
| Workflow Record | Watch user, extract steps | "watch me" |
| Workflow Replay | Replay saved workflows via vision | "replay my-workflow" |
| Memory | Firestore session memory across restarts | Auto on startup |
| Personalized | Name-aware, personality-driven responses | settings.json |
Every input is routed by llama3.3-70b-instruct on DigitalOcean Gradient AI to the right specialist:
User Input
|
v
+-----------------------------------------------------+
| ROUTER (llama3.3-70b via DO Gradient AI) |
+--+----------+----------+----------+----------+------+
| | | | |
v v v v v
direct screen cua_cli cua_vision browser
response annotator (shell) (mouse+KB) (playwright)
| | | | |
| bounding open -a go_to_ navigate
| boxes Notion element click
| + labels git clone click_left fill form
v | ls ~/ type_str submit
answer overlay | | |
text terminal cursor chrome
output moves
The standout feature. GhostOps watches you work and learns to replicate it:
RECORD EXTRACT REPLAY
------ ------- ------
User: "watch me" Last frame -> Gemini For each step:
| vision -> |
v JSON steps: v
Screenshot every 2s [{ VisionAgent.execute(
+ action: "click", "click the New Page
voice transcription target: "New Page btn", button"
captured into frames value: "" )
| }, ...] |
v | v
User: "remember this Saved to Screenshot ->
as new-page" Firestore + find element ->
local cache move cursor ->
click ->
verify -> next step
| Layer | Technology | Purpose |
|---|---|---|
| AI Inference | DigitalOcean Gradient AI (Serverless) | All LLM routing, text generation, vision analysis, function calling |
| Models | llama3.3-70b-instruct, openai-gpt-4o (via DO) | Agent routing, screen understanding, command generation |
| Overlay UI | Electron 35 + HTML Canvas | Transparent, always-on-top, cross-workspace overlay |
| IPC | WebSocket (Python <-> Electron) | Low-latency bidirectional drawing commands |
| Voice | Gemini Live API (2.5 Flash) | Real-time streaming audio I/O |
| Browser | Playwright + browser-use | Reliable cross-browser automation |
| Screenshot | PIL ImageGrab + mss | macOS-native screen capture |
| Mouse/KB | pyautogui | Cross-platform desktop control |
| Memory | Google Cloud Firestore | Real-time, serverless, persistent sessions |
| Backend | FastAPI on DO App Platform | Auto-scaling, zero-ops hosting on DigitalOcean |
| TTS | ElevenLabs (optional) | Natural voice output |
| Language | Python 3.13 + Node.js 18+ | Backend + UI |
| Model (on DO) | Used For | Notes |
|---|---|---|
llama3.3-70b-instruct |
Agent routing + CLI command generation | Fast, accurate classification and text gen |
openai-gpt-4o |
Vision: screenshots, element detection, GUI automation | Multimodal, understands screen layouts |
gemini-2.5-flash |
Voice sessions (Gemini Live API) | Real-time bidirectional audio streaming |
All inference calls go through:
POST https://inference.do-ai.run/v1/chat/completions
Authorization: Bearer $GRADIENT_MODEL_ACCESS_KEY
The DO provider (core/do_provider.py) is a drop-in replacement using the OpenAI-compatible API, with full support for:
- Text generation (
generate_text) - Vision analysis (
generate_vision) - Vision + function calling (
generate_vision_with_tools) - Audio transcription (fallback to Groq Whisper)
| Requirement | Version |
|---|---|
| macOS | 12+ (Monterey or later) |
| Python | 3.13+ |
| Node.js | 18+ |
| uv | latest |
| DigitalOcean account | Sign up for $200 free credits |
git clone https://github.com/jkanishkha0305/ghostops.git
cd ghostops# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create venv and install dependencies
uv venv .venv --python 3.13
source .venv/bin/activate
uv pip install -r requirements.txtcd ui
npm install
cd ..cp .env.example .envEdit .env and fill in your keys:
# DigitalOcean Gradient AI (Required)
GRADIENT_MODEL_ACCESS_KEY=your_do_model_access_key
# DigitalOcean API Token
DIGITAL_OCEAN_KEY=your_digitalocean_api_token
# Google (for Gemini Live API voice + Firestore)
GEMINI_API_KEY=your_gemini_api_key
GOOGLE_CLOUD_PROJECT=your_gcp_project_id
# Optional
ELEVENLABS_API_KEY=your_elevenlabs_key # for voice output
CLOUD_RUN_URL=https://your-service.run.app # for persistent memory
FIRESTORE_SESSION_ID=your_usernameGet your DO Model Access Key:
- Go to DigitalOcean Control Panel
- Navigate to Gradient AI Platform -> Serverless Inference
- Scroll to Model Access Keys -> Create Access Key
Or via API:
curl -X POST -H "Authorization: Bearer $DIGITAL_OCEAN_KEY" \ -H "Content-Type: application/json" \ "https://api.digitalocean.com/v2/gen-ai/models/api_keys" \ -d '{"name": "ghostops"}'
Edit settings.json:
{
"user_name": "YourName",
"agent_name": "GhostOps",
"personalization": "Be concise and slightly witty."
}- System Settings -> Privacy & Security -> Screen Recording -> add Terminal + Electron
- System Settings -> Privacy & Security -> Accessibility -> add Terminal + Electron
- System Settings -> Privacy & Security -> Microphone -> add Electron (for voice input)
Open two terminals:
Terminal 1 -- Python backend:
source .venv/bin/activate
python app.pyTerminal 2 -- Electron overlay:
cd ui
npm run devYou should see:
Models loaded - using DO Gradient AI serverless inference
Visualization server listening at ws://127.0.0.1:XXXX
Overlay client connected.
| Shortcut | Action |
|---|---|
Cmd + Shift + Space |
Show/hide the command overlay |
Cmd + Shift + C |
Stop all running tasks immediately |
Cmd + Shift + M |
Toggle TTS mute |
Escape |
Dismiss overlay |
Enter |
Submit command |
Direct answers
"what's the square root of 144"
"explain what a webhook is"
Screen annotation
"what's on my screen"
"explain what I'm looking at"
"point to the settings icon"
CLI tasks
"open notion"
"create a folder called projects on my desktop"
"what's my local IP address"
Computer use (app must be open)
"click the new note button"
"type hello world in the search bar"
"calculate 18% tip on $84 using the calculator"
Browser automation
"search google for best coffee shops near me"
"go to github.com and search for electron"
Workflow learning
# Start recording
"watch me"
# Do your workflow manually (GhostOps records every 2 seconds)
# Save it
"remember this as setup-project"
# Replay any time
"replay setup-project"
When GhostOps performs a computer-use task, this is the per-step loop:
+-------------------------------------------------------------+
| SINGLE CALL VISION ENGINE |
+----------------------------+--------------------------------+
|
+---------------v---------------+
| 1. Capture screenshot |
| (active window or full) |
+---------------+---------------+
|
+---------------v---------------+
| 2. Send to openai-gpt-4o |
| via DO Gradient AI |
| with task + tool schema |
+---------------+---------------+
|
+---------------v---------------+
| 3. Model returns tool calls: |
| go_to_element(bbox) |
| click_left_click() |
| type_string("hello") |
| press_ctrl_hotkey("s") |
| task_is_complete() |
+---------------+---------------+
|
+---------------v---------------+
| 4. Execute tool calls |
| (pyautogui / subprocess) |
+---------------+---------------+
|
+---------------v---------------+
| 5. Loop detection: |
| same action x3 -> stop |
| same click x5 -> fallback |
+---------------+---------------+
|
task_is_complete?
YES --+ NO -> back to step 1
ghostops/
|
+-- app.py <-- Main entry point
+-- settings.json <-- User config (name, personality, models)
+-- .env <-- API keys (never committed)
+-- LICENSE <-- MIT License
|
+-- agents/
| +-- adk_orchestrator.py <-- Google ADK multi-agent orchestrator (9 tools)
| +-- screen/ <-- Screen annotation (bounding boxes + labels)
| | +-- agent.py
| | +-- tools.py <-- draw_bounding_box, draw_text, etc.
| | +-- prompts.py
| |
| +-- cua_vision/ <-- Computer use (sees screen -> clicks)
| | +-- agent.py <-- VisionAgent
| | +-- single_call.py <-- Vision execution loop + loop detection
| | +-- tools.py <-- go_to_element, click, type_string, etc.
| | +-- prompts.py
| |
| +-- cua_cli/ <-- Shell agent (subprocess execution)
| | +-- agent.py <-- CLIAgent, runs shell commands safely
| |
| +-- browser/ <-- Browser automation (Playwright + browser-use)
| | +-- agent.py
| |
| +-- workflow/ <-- Record + replay engine
| +-- engine.py <-- start_recording, stop_and_save, replay
|
+-- models/
| +-- models.py <-- Router + agent dispatch via DO Gradient AI
| +-- function_calls.py <-- Tool declarations for all 6 routes
| +-- prompts.py <-- Personalized system prompts
|
+-- core/
| +-- do_provider.py <-- DigitalOcean Gradient AI provider (primary)
| +-- gemini_provider.py <-- Gemini provider (voice/fallback)
| +-- groq_provider.py <-- Groq provider (STT fallback)
| +-- settings.py <-- Read/write settings.json
| +-- registry.py <-- Shared overlay state
|
+-- voice/
| +-- live_api.py <-- Gemini Live API voice session
|
+-- ui/
| +-- main.js <-- Electron main process
| +-- renderer.js <-- Canvas rendering + WebSocket client
| +-- preload.js <-- IPC bridge
| +-- index.html <-- Overlay HTML
| +-- server.py <-- WebSocket VisualizationServer
| +-- animations/ <-- UI component JS + CSS
| +-- dom_nodes/ <-- Draggable response bubbles + annotation boxes
| +-- package.json <-- Electron + forge dependencies
|
+-- backend/
| +-- main.py <-- FastAPI backend (DO App Platform)
| +-- memory.py <-- Firestore memory read/write
|
+-- .do/
| +-- app.yaml <-- DO App Platform deployment spec
|
+-- desktop/
| +-- screen.py <-- Screenshot capture (PIL/mss)
|
+-- integrations/
| +-- audio/
| +-- tts.py <-- ElevenLabs TTS
|
+-- deploy/
+-- deploy.sh <-- One-command Cloud Run deployment
The backend is deployed on DigitalOcean App Platform with auto-deploy on push:
Live URL: https://clownfish-app-dqd9h.ondigitalocean.app
The app spec is in .do/app.yaml. To deploy your own:
- Fork this repo
- Go to DigitalOcean App Platform -> Create App
- Connect your GitHub repo, select
mainbranch - It auto-detects the
Dockerfileand deploys - Add env vars:
AI_PROVIDER=do,GRADIENT_MODEL_ACCESS_KEY,GEMINI_API_KEY
Or via CLI:
doctl apps create --spec .do/app.yamlThen set the live URL in your .env:
CLOUD_RUN_URL=https://clownfish-app-dqd9h.ondigitalocean.app| Collection | Purpose |
|---|---|
sessions/{id}/turns |
Conversation memory per user |
workflows/{name} |
Saved workflow steps |
- No data leaves your machine except API calls to DO Gradient AI inference, DO App Platform backend, and Firestore
- Screenshots are captured in-memory and sent directly to the model -- never written to disk
.envis gitignored -- API keys are never committed- The overlay window is
focusable: falseby default -- it doesn't steal keyboard focus until summoned - All shell commands are safety-checked -- dangerous commands (
rm -rf /,mkfs, etc.) are blocked - All inference routes through DO's secure, authenticated endpoints
Overlay doesn't appear
# Check Python server is running
python app.py # should print "Overlay client connected"Mouse clicks land in wrong place
Ensure macOS Accessibility permission is granted for Terminal and the Electron app.
No audio from voice input
Check microphone permission in System Settings -> Privacy -> Microphone -> allow Electron.
DO inference errors
Verify your
GRADIENT_MODEL_ACCESS_KEYis set correctly in.env. Test with:curl -X POST https://inference.do-ai.run/v1/chat/completions \ -H "Authorization: Bearer $GRADIENT_MODEL_ACCESS_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"llama3.3-70b-instruct","messages":[{"role":"user","content":"hello"}],"max_completion_tokens":256}'
- Deploy agents via DO Agent Development Kit (ADK)
- DO Knowledge Bases for RAG-powered agent memory
- DO Guardrails for safe agent action filtering
- Multi-user sessions with per-user isolation
- Windows + Linux support
- Workflow sharing and export/import
- Proactive agent heartbeat (checks in on you periodically)
- DigitalOcean Gradient AI -- Serverless inference powering all AI agents
- Gemini Live API -- Real-time voice streaming
- browser-use -- Browser automation framework
- Google GenAI SDK -- Multimodal AI backbone
- Electron -- Desktop overlay framework
MIT License -- see LICENSE for details.
Built with DigitalOcean Gradient AI for the DigitalOcean Gradient AI Hackathon
GhostOps -- because the best interface is the one that's invisible