Inspiration
I'd been using Gemini Live on my Android phone a lot — just talking to it, getting instant natural replies, super conversational. And one day I was sitting at my MacBook clicking through tabs, dragging windows around, copy-pasting between apps, and I thought: what if my Mac could do this? Not just chat — actually see my screen and control it. The Gemini app on Android doesn't really control your phone, it has its own internal tools. But what if I gave an AI a mouse, a keyboard, and a live view of the screen? It should be able to do everything I do.
What it does
Aura is a native macOS app you talk to. It sees your screen, hears you, and can control your Mac — clicking, typing, scrolling, opening apps, filling forms, navigating the web. You just talk to it like a person.
"Open Safari and search for flights to Tokyo." "What's in this spreadsheet?" It looks at the screen, figures out what's going on, and does it. No typing prompts. No describing what you see. You just talk and it acts. It has real-time screen capture, mouse and keyboard control, voice conversation through Gemini Live's native audio, JavaScript injection for browser automation, and it remembers things about you across sessions. It's not perfect — mouse targeting still needs the right guidance and some tasks need a nudge in the right direction — but when it works, it genuinely feels like talking to someone who can use your computer for you.
How I built it
I went with Rust because I needed near-native speed for real-time audio and screen capture. That decision almost killed the project early — there's no official Google SDK for Rust. The community-maintained one I found didn't have what I needed. So I ended up building the Gemini Live API WebSocket client from scratch, figuring out the protocol from pretty sparse docs and a lot of trial and error.
The whole thing is a Rust workspace with 10 crates — audio capture, playback, screen capture, Gemini WebSocket client, AppleScript bridge, accessibility reader, mouse/keyboard injection via CoreGraphics, SQLite memory, a Firestore client and an orchestrator daemon that wires everything together. The frontend is a SwiftUI app that talks to the Rust daemon over Unix domain sockets.
On the cloud side there's a Cloud Run proxy for multi-device auth, and a Python memory agent built on Google's Agent Development Kit running on Cloud Run. The ADK agents handle three jobs — an ingest agent that extracts memories from conversation sessions, a consolidate agent that finds patterns across memories, and a query agent that retrieves relevant context when Aura needs it. All backed by Cloud Firestore for persistent storage.
I also set up the full CI/CD with GitHub Actions, staging and prod environments, Firestore rules, automated deployments —probably spent more time on that than I expected but I wanted it to feel like a real product not just a demo.
Built the whole thing solo, with Claude Code helping me through the worst debugging sessions.
Challenges I ran into
Finding out there's no Rust SDK for Gemini was the first surprise. Then I discovered the Live API sends Binary WebSocket frames, not Text — my code was silently ignoring every message from the server. Spent hours wondering why nothing worked before I figured that out.
Mouse targeting is honestly still my biggest problem. I'm streaming screenshots at 2 FPS to keep things fast, but that means Gemini is working with approximate visual context. Getting it to click the right button, the right link — consistently — is really hard. I built a vision oracle that refines Gemini's approximate coordinates into precise click targets, an accessibility tree reader that can find and click UI elements by label and role, and a post-action verification pipeline that checks if the click actually landed. And it's still not perfect. This one kept me up for nights straight, maybe four hours of sleep each night, and Aura still wouldn't click the right place. That was my lowest point. I genuinely thought about giving up. I was pretty close to abandoning the project as a whole.
The other big one is barge-in. Gemini's API is designed for you to stream audio continuously and it handles interruption server-side. But it has no echo cancellation. So on a MacBook with speakers, Aura hears its own voice through the mic and interrupts itself. Google's answer is literally "use headphones." I had to build client-side energy gating as a workaround — calibrate ambient noise on startup, apply a threshold multiplier during playback, require consecutive frames above threshold to filter transients. It works but it's a constant battle between catching real speech and ignoring false triggers.
MacOS Cocoa FFI in Rust was also a minefield. Things crash silently with no error messages. NSApp is a global function, not an ObjC class — class!(NSApp) just crashes with nothing useful in the logs. Menu bar click handlers need specific event mask bitflags. Every one of these cost hours of debugging with zero guidance since there are no Rust examples to reference.
Accomplishments that I'm proud of
The first time Aura actually replied to me — with personality, sounding like a real person — I couldn't stop grinning. It felt like I was actually talking to my Mac.
It's a 17MB native app. No Electron, no web wrapper. It can actually click buttons, fill forms, run scripts, navigate websites. Not just chat about doing it. Feels like it belongs on macOS.
The audio pipeline runs end-to-end in real-time — mic capture at 48kHz, resampled to 16kHz, streamed over WebSocket, Gemini responds with native audio at 24kHz, and it plays back through speakers with a 40ms pre-buffer. The whole loop feels like a conversation, not a request-response cycle.
Screen awareness actually works — Aura sees what you see, tracks changes with perceptual hashing so static screens cost zero tokens, and drains stale frames so Gemini always gets the latest view. It can read accessibility trees, inject JavaScript into browsers, run AppleScript, and verify its own actions landed correctly.
The memory system spans local and cloud — SQLite for fast session storage on device, Firestore for cross-device persistence, and three ADK agents on Cloud Run that extract, consolidate, and query memories. Aura actually remembers things about you between sessions.
Session resumption means Aura can reconnect after a network drop without losing conversation context. Proactive session rotation prevents the latency degradation that happens in long-lived Gemini audio sessions. Exponential backoff with jitter, stale handle detection, graceful degradation when permissions are missing — the kind of reliability stuff that nobody sees but everything breaks without.
The whole thing ships as one self-contained macOS app with full CI/CD — automated builds, tests, staging and prod deployments to Cloud Run, release DMG generation, all through GitHub Actions.
What I learned
- "How hard could it be?" is always the wrong question. Every layer had hidden complexity that only showed up after I started building.
- The simplest pipeline wins. I kept adding buffers, filters, gates, smoothing to the audio path. Every one added latency. The best optimizations were removing things, not adding them.
- Rust is incredible for this kind of work but it's lonely — no SDK, no examples. You figure it out yourself.
- Building a product is way different from building a feature. The CI/CD, environments, permissions, error handling, graceful degradation — that's like 60% of the work and none of the glory.
What's next for Aura
Better mouse targeting is number one — that's the thing holding it back the most. After that, WebRTC for proper echo cancellation so barge-in works reliably without headphones. Honestly, as Gemini gets faster at processing video and native audio improves, Aura just gets better for free. The architecture is ready for it.
Built With
- accessibility-api
- applescript/jxa
- cloud-build
- cloud-build)
- cocoa/appkit
- docker
- fastapi
- firebase
- firestore
- gemini-live-api-(v1beta-native-audio)
- github
- google-agent-development-kit-(adk)
- google-cloud-platform-(cloud-run
- google-cloud-platform-(cloud-run)
- macos
- macos-coregraphics
- python
- rust
- screencapturekit
- secret-manager
- sqlite
- swift/swiftui
- webrtc-vad
- websockets-(tokio-tungstenite)
Log in or sign up for Devpost to join the conversation.