Inspiration
AI is evolving so fast that even the people building it are falling behind on the tools they use. Every one of us has felt it – a designer picking up a new 3D tool, a first-generation student opening Xcode for the first time, a grandmother trying to book a telehealth appointment in MyChart, or even a developer staring at a Kubernetes dashboard they've never used before. The bottleneck is no longer intelligence or effort; it's the time between "I want to do this on a computer" and "I figured out which button to click." Generic YouTube tutorials are frozen in time, documentation assumes you already know the vocabulary to search for, and real human tutors cost money and sleep. The gap stays open, and the people it hurts most are the people who most need the thing on the other side of it.
After playing around with Farza’s Clicky assistant on github, we wanted to build the friend who sits next to you while you learn – a companion that can actually see your screen, hear your voice, and point at the exact thing you're looking for, in whatever app you happen to be in. In addition, after reading “The Emperor Has No Clothes: How to Code Claude Code in 200 Lines of Code,” I wanted to create a tool registry for live-computer usage similar to Claude Code while having access to the user’s screen when granted.
What it does
Claude Cursor is an AI-native cursor companion that tutors you on any question you ask it and can even provide step-by-step tutorials if you give it a Youtube URL. You hold ctrl+option, ask a question in plain voice, and Claude Cursor flies across your screen to the answer while Claude narrates what it is through ElevenLabs voice. Under the hood, AssemblyAI streams your voice as you speak, ScreenCaptureKit grabs a screenshot of every monitor, and Claude Sonnet 4.6 takes the transcript and the images as a single vision request. Claude then decides through native tool use whether to point at a single element, deploy up to eight colored sub-cursors to label an entire interface at once (explain_screen_elements), pop open a persistent markdown answer panel for long-form content, query its own memory of your past sessions, research a topic on the web, or start a full Computer Use automation loop with your consent.
On top of that core loop, Claude Cursor ships a YouTube lesson extractor that turns any tutorial video into a step-by-step lesson with a picture-in-picture player that auto-seeks to the right timestamp, a tutor mode that watches silently and offers help after three seconds of user idleness, a chat window with a Granola-inspired sidebar that groups past sessions by app and even by browser tab, and a personal wiki under that grows with every session and can be queried later. Every interaction is screen-aware, voice-driven, and tuned to teach rather than replace – the cursor points, the voice narrates, and you do the clicking, so skill accumulates instead of atrophying.
How we built it
The app is a Swift and SwiftUI codebase with AppKit for the menu bar icon, floating panel, and full-screen cursor overlay. The main loop is global push to talk, microphone audio through a pluggable transcription stack, multi monitor screenshots, Claude over server sent events with tools handled in one place, animated on screen pointer, and streamed text to speech, while a small router chooses whether to show navigation on screen, a lesson, a long answer, or chat from signals like tools used, content type, and mode. API keys never ship in the binary; a Cloudflare Worker proxies every provider and also supports the Computer Use loop with stepwise screenshots, safety refusals that stop the run, stuck detection, and token telemetry in logs. Long term memory is a local markdown wiki built from session logs (scrubbed before write), then summarized and deduplicated, with retrieval ranked by weighted keyword scoring into a fixed size context bundle. Shorter lived or operational data such as chat segments, tutor limits, and automation stats lives in SQLite, with session grouping logic that batches cleanup work at the end of a session.
Challenges we ran into
The hardest work was not the model calls but operating system plumbing and streaming edge cases. A listen only CG event tap is what reliably catches modifier only shortcuts like control plus option while the app stays in the background, because the global monitor drops events under load. Reusing one long lived URL session for AssemblyAI streaming fixed connection pool corruption that showed up as socket errors after rapid reconnects. Claude tool JSON arrives in fragments over server sent events until a block completes, so the client needs an accumulator that can handle partial JSON. YouTube captions required moving off a deprecated Innertube client and routing through the Worker with a fallback scrape of the watch page embed data. Computer Use screenshots must match the declared tool resolution exactly or click coordinates drift. Safety used a hybrid default of Anthropic Computer Use with per step screenshots, with local one shot automation left as a debug only path, plus bundle deny lists, stopping on the first model refusal, a perceptual hash stuck detector, and consent UI with keyboard support and timeouts. The overlay also needed correct multi monitor coordinates, no focus stealing, all Spaces coverage, click handling only on the consent bubble, and smooth fades for transient mode, each of which came down to careful panel setup and a lot of iteration.
Accomplishments that we're proud of
We are proud that we built a companion that closes that gap across any piece of software, on any screen, in any language Claude speaks, and we are proud that the interaction grammar of the product – pointing instead of clicking for you, narrating instead of executing, observing before intervening – actually teaches people while it helps them. In other words, what we are proudest of is not any single technical feat, but instead the specific humans this app was built to meet, and the moments it was built to meet them in. A small-business owner who has been paying for QuickBooks for three years and never felt confident enough to run a report. A career switcher teaching herself Figma at two in the morning between shifts. A parent with dyslexia who needed the same software walkthrough explained four different ways before it clicked. These are the exact people every interaction in Claude Cursor was designed around, and the reason the product exists at all is that none of them currently have access to the thing a well-resourced person takes entirely for granted: a patient expert sitting next to them who can see their screen and show them where to click. The skill doesn't evaporate when we leave the room; it accumulates. That is what "empowerment over replacement" means in practice, and we took it seriously every time we made a design decision. On the technical side, we are proud that Claude is the actual brain of this product rather than a feature bolted onto the side. We lean on the model across the full lifecycle of a session: understanding the user's screen in real time, deciding which action to take and which surface to render the answer on, running self-correcting agent loops when the user opts into automation, distilling raw session transcripts into durable long-term memory, consolidating duplicate memories so the knowledge base stays clean, greeting returning users with a recap of what they were working on last time, and turning arbitrary YouTube tutorials into structured step-by-step lessons with the player seeking to the right timestamp for each step. The long-term memory is real, where personal sessions get stripped of sensitive information, compressed into structured notes, indexed, and retrieved later with weighted relevance scoring, so the app genuinely learns the user's tools and terminology over weeks of use rather than starting from zero every time.
What we learned
We learned that shipping on a frontier model depends less on prompt tweaks than on interaction design. The biggest lever for how the product feels was choosing where each answer goes, which we treat as a small deterministic classification into the cursor overlay, lesson overlay, answer panel, or chat, and narrowing the answer panel to heavier math and code content stopped the UI from feeling overbuilt for light questions. We also learned that safety for a screen aware agent is not a bolt on but the shape of the system: opt in gates, per run consent, bundle deny lists, stopping the loop on the first refusal instead of retrying around it, stuck detection, logging, and a kill switch only work as a bundle. Treating model refusals as hard stops, not errors to bypass, was the single most important safety behavior. Finally, empowerment over replacement is visible in the defaults: pointing keeps the user in control and builds habit, while full automation stays an explicit exception, which is why the product reads as teaching rather than taking over.
What's next for Claude Cursor
The next big step is a Claude Cursor that can act on the user’s behalf in a structured, consent driven way, not as a one off prompt dump but as trusted routines users define once and run on a schedule or trigger, such as weekly MyChart checks, weekly bookkeeping drafts, or nightly school announcements, where the hard part is ethics more than the existing agent loop because recurring automation needs stronger guardrails than a single run with consent. We expect routines to reconfirm consent on a cadence, log every run in a readable ledger, offer one click pause or delete, keep high risk domains off routine automation by default, and stop loudly when runs fail repeatedly so silent breakage is not an option. Beyond that, the roadmap includes visible disclaimers when content touches medical, legal, or financial topics, a fully offline transcription path for restricted networks, richer and user editable wiki memory, installable wiki packs for common tools, and institutional deployments where partners self host the proxy and subsidize access so we can measure whether this kind of companion improves real outcomes for people who are not already power users.
Built With
- anthropic
- api
- avfoundation
- cloudflare
- markdown
- objective-c
- screencapturekit
- sql
- swift
- swiftui
- typescript
- websockets
Log in or sign up for Devpost to join the conversation.