DEMO WILL BE AVAILABLE ON MY WEBSITE SOON, VIEW LINK

Inspiration

Games have always been a proving ground for AI, but teaching AI to play them has historically required massive compute, millions of hours of training data, and teams of PhD researchers. DeepMind's SIMA, Microsoft's Voyager, and OpenAI's VPT are at the frontier of research, but they are isolated systems that are fine-tuned for specific games. I wanted to see if there was a different way to approach the issue of making AI gaming more accessible and more general.

Also, you might have heard of something like this that already exists (like Open Interpreter, a general-purpose agent that can control your computer). GhostPlayer is different in a specific and interesting way. General computer-use agents operate on 2D interfaces: clicking buttons, reading text. The problem is that many games are 3D environments, and that introduces a problem those agents are not built to handle yet: depth perception. A vision model can see that a tree is in front of you. It cannot reliably tell you whether you are close enough to hit it. Rather than trying to solve monocular depth estimation from a single rendered frame, which is an open research problem, GhostPlayer treats depth as an unreliable signal entirely and moves that responsibility to the control layer. The agent performs an action, checks whether anything actually changed in the environment, and uses that outcome as ground truth. Action feedback replaces visual estimation. It is a small insight, but it is the right one for this problem.

What it does

GhostPlayer is an AI agent that plays games autonomously using only what it can see on screen, i.e. no in-game mods. You give it a goal in plain English ("chop the tree, then mine some coal"), and it figures out the rest. It captures the screen in real time, sends frames to Gemini Vision for perception and planning, and executes keyboard and mouse actions just like a human player would. A state machine keeps it on track, a memory layer helps it learn across runs, and a web dashboard lets you watch and control it live.

How we built it

The core loop is: capture frame, ask Gemini what it sees and what to do next, execute those actions via raw system-level input, verify the result, repeat. On macOS, Minecraft uses raw HID mouse delta events which normal cursor APIs completely ignore, so I had to go down to the Quartz CGEvent level to post actual hardware mouse events. The agent is structured as a state machine (SCAN, MINE, SEARCH, APPROACH) with Gemini driving perception at each tick. Persistent memory is handled by Backboard, which summarizes run history and injects past lessons into future prompts.

Challenges we ran into

Minecraft's raw input mode was the first wall. Every mouse movement library I tried was silently ignored by the game. Figuring out Quartz CGEventPost with explicit delta fields took real digging. The second challenge was depth perception: Gemini can see that a tree is in front of you, but it cannot reliably judge whether you are close enough to hit it. I solved this not by trying to fix Gemini's depth perception but by building a fallback in the orchestrator: if the agent swings and misses twice, it steps forward. Simple, but it works.

Accomplishments that we're proud of

Getting the agent to autonomously chain actions, such as "walk to a tree, chop multiple logs while looking up to find each one, then pivot to mine coal", with zero game-specific training data, feels genuinely surprising every time it works. The depth perception workaround is a good example of the philosophy: instead of making the AI smarter, we can make the system more robust around its blind spots. I also think the architecture itself is worth noting, because the whole thing runs locally, costs pennies per run in API calls, and requires no research infrastructure.

What we learned

Frontier vision models are more capable than most people realize for real-time interaction tasks. However, they are not plug and play. Getting reliable behaviour out of Gemini required carefully constraining what it could and could not do at each stage, designing the right prompts, and building deterministic fallbacks for everything it consistently got wrong. The lesson is that LLM-based agents need structure around them. You can't just implement Gemini in a loop or something.

What's next for Ghostplayer

Gemini 2.0, GPT-4V, and similar models only recently reached the visual reasoning fidelity needed to understand game environments in real time. GhostPlayer sits at the exact intersection of this capability becoming available and nobody having yet built the accessible consumer layer on top of it. The research frontier is building increasingly specialized models. We'll build the platform that makes all of them usable.

Built With

Share this project:

Updates