Inspiration
The chatbot is dead. Long live the Operator.
When Google announced Project Astra, we saw the future of an assistant that could see the world. When Gemini Live dropped, we saw the future of an assistant that could speak fluently.
But we noticed a massive gap: Who is actually doing the work?
We didn't want another sidebar chat window. We didn't want to copy-paste code blocks. We wanted an AI that could reach out of the LLM context window and actually touch the interface. We wanted to build the "Hands" of Gemini.
CLOVIS was born from a simple question: If Gemini can understand my screen, why can't it control it? We set out to build the bridge between the pixel and the shell—an agent that doesn't just suggest solutions, but executes them.
What it does
CLOVIS is a System-Native Spatial Agent. It is an AI operator that lives on your screen, not in a browser tab.
- It Sees & Draws: unlike standard agents that are "blind," CLOVIS draws directly on your screen (Spatial UI) to show you what it's looking at. It highlights buttons, boxes logic gates, and visually confirms its targets before acting.
- Dual-Core Agency: CLOVIS utilizes two distinct agentic loops:
- The Vision Agent: Handles GUI interactions. It navigates Spotify, finds "Waldo" in a crowd, and adjusts MacOS System Settings using mouse and keyboard injection.
- The Terminal Agent: Handles deep system tasks. It can debug code, run scripts, and manage files in the background without needing a GUI.
- The "Boomerang" UI: A novel UX pattern where the agent exists as a non-intrusive status pill when working, but expands into a full command center when needed, respecting the user's screen real estate.
How we built it
We engineered a high-performance loop around Google Gemini 1.5 Pro and Flash.
- The Eyes (Perception): We built a high-speed screen capture pipeline that feeds the current OS state into Gemini's multimodal vision window.
- The Brain (Reasoning): We prompt-engineered Gemini to act as a coordinate-mapper. The model analyzes the screenshot and returns precise
(x, y)coordinates for UI elements, rather than just describing them. - The Hands (Action): These coordinates are fed into a native Python automation layer (similar to PyAutoGUI/Quartz) that injects human-like mouse movements and keystrokes.
- The Spatial Overlay: We built a transparent UI layer that sits on top of the OS (Z-Index 9999). This allows Gemini to "draw" bounding boxes and annotations in real-time, creating a HUD (Heads Up Display) for the AI's thought process.
- Agent Orchestration: A router determines if a task requires "Vision" (GUI) or "Terminal" (CLI) and delegates the task to the specialized sub-agent to minimize latency and hallucinations.
Challenges we ran into
- The "Look vs. Touch" Gap: LLMs are great at describing images, but terrible at pixel-perfect coordinates. We spent days refining the coordinate-mapping prompts to ensure CLOVIS hits the "Play" button on Spotify and not the "Shuffle" button next to it.
- Latency vs. Accuracy: Real-time computer use requires low latency, but high accuracy. Balancing the speed of Gemini 1.5 Flash for UI navigation with the reasoning power of 1.5 Pro for complex logic was a constant tuning process.
- OS Permissions: MacOS does not like software taking control of the mouse. We had to navigate complex accessibility permissions to give CLOVIS the "God Mode" access it needs to function.
Accomplishments that we're proud of
- The "Waldo" Test: We successfully proved that CLOVIS has pixel-perfect visual acuity by having it find and highlight a specific character in a dense "Where's Waldo" scene.
- Breaking the Browser: Most agents are stuck in Chrome. We are proud that CLOVIS can control native Desktop apps (System Settings, VS Code, Terminal) just as easily as a website.
- The "Spatial" Feedback Loop: Building the system that lets the AI draw on the screen. seeing the green box appear around a button moments before the mouse moves to click it feels like magic—it builds immediate trust with the user.
What we learned
- Context is King: An agent without context is just a click-bot. By giving Gemini access to both the screen (pixels) and the terminal (system state), the model became exponentially more capable than when it had access to just one.
- Users Need Visuals: In "Black Box" automation, users get anxious. By having CLOVIS draw its thought process on the screen, we solved the trust gap.
- Gemini is Ready: The 1.5 models handle multimodal context (text + image + code) significantly better than previous generations, making this kind of "Visual OS Agent" finally possible.
What's next for CLOVIS
- Speed: Optimizing the screenshot-to-action loop to reach <500ms latency.
- Voice Mode: Integrating Gemini Live real-time voice API so you can talk to CLOVIS while it works, creating a true "Iron Man" experience.
- Safety Guardrails: Implementing a "Human-in-the-Loop" confirmation for high-stakes actions (like deleting files or sending

Log in or sign up for Devpost to join the conversation.