Inspiration

a16z's Anish Acharya put out an open ask for a startup: a "contextual companion" for his son who plays minecraft. https://x.com/a16z/status/2022014770682245610

"one of the products that i would love to exist... is what i call a contextual companion for my son who plays minecraft."

"the other kids playing minecraft may or may not be the best influence, often not the best influence. there's this context in which they interact and it sort of models pro-social behaviors and is still cool and chill."

"there's a lot of room for teaching through these types of relationships, and technology can help provide that."

that hit. so we figured: don't build a demo, build the thing a16z is literally asking someone to build.

cause minecraft is already better with another person around, but not everyone has someone online the exact moment they want to play. we wanted a companion that feels less like a chatbot and more like a real duo partner: joins your world, sticks close, helps out, talks in voice, and slowly learns how you play. quiet when it should be, cool and chill, present.

What it does

itto is an ai minecraft co-op buddy. it spawns into your world, follows you around, and actually does things: chops trees, mines veins, digs staircases down to go mining, fights mobs, fetches items from chests it remembers, scouts ahead, crafts tools, and places blocks. it picks the right tool on its own (axe for wood, pickaxe for ore, sword for mobs). and it hangs in a discord voice call while you play, talking back in real time.

the goal isn't a tutorial bot or an overbearing assistant. itto is meant to feel like it's in the world with you: mostly chill, useful when needed, casually part of the session.

How we built it

itto runs on a two-loop architecture split across a brain and a body.

the fast loop runs through mineflayer at ~15Hz with no llm in the path. it owns the physical, low-latency stuff: pathfinding follow, staying in range, lava/creeper reflexes, auto-eat. itto never freezes mid-step waiting on a model.

the slow loop is the reasoning layer. it reads compact structured minecraft state (position, health, nearby blocks/mobs, inventory, what you're looking at, current goal) and fires off intents. the seam between brain and body is a single typed control interface (BotControl) exposed over an mcp server: tools like move_to, mine_block, place_block, craft_item, run_skill, set_goal, plus resources for live world state and persistent memory. the body never knows who the brain is, it just exposes a clean control surface.

the brain is am agent harness, loaded with a stack of mcp skills. it reads the world over mcp, decides, and drives the bot through those same tools. on top of the primitives we built a real skill library (chop_tree, mine_vein, mine_down, combat_assist, fetch_item, scout_ahead, build_helper, craft, make_tools), a goal loop so one instruction can kick off a multi-step task that runs in the background and reports back when it's done, and a sqlite world memory for waypoints + chest contents that survives across sessions.

the part we're hyped on: a 3-way voice bridge. elevenlabs conversational ai joins the discord call and handles ears + mouth (STT, TTS, turn-taking). it forwards what you say to our agent harness (the brain), the brain decides and acts in-world, then pushes a line back out an outbox that elevenlabs voices in the call. so the loop is: you talk -> brain hears it -> bot acts -> brain talks back, all live. discord + voice + game state + bot control, one connected system.

Landing page

we also built a reverse-engineering skill: point it at literally ANY existing website like a framer template, and it reconstructs a 1:1 fidelity landing page from it. we handed that skill to our agent and let it build itto's actual landing page. our tooling literally shipped a major part of our own project.

Challenges we ran into

the hardest part was making itto feel alive without putting the model in the critical path. if every movement or safety call waited on an llm, the whole thing feels laggy and dead. the fast-reflex / slow-reason split was the key design decision, and keeping one owner of pathfinding (so skills don't fight the follow loop) took real care but still not fully solved.

we also had to be ruthless about what the brain actually sees. mineflayer hands you a ton of structured state, so we leaned on that instead of screenshots and kept the per-tick payload token-cheap, with richer perception available on-demand. getting crafting to actually work was its own saga: vanilla recipe calls silently no-op'd on the server version, so we drive the crafting table window by hand. and designing itto's personality so it reads like a friend, not a support agent stuck in a block game.

Accomplishments that we're proud of

itto has a real architecture, not a demo prompt glued to minecraft. there's a body, a typed control surface, a clean brain boundary over mcp, live world state, persistent memory, a goal loop, a skill library, a 3-way voice bridge, and a landing page our own agent built from a skill we wrote.

the two-loop design solves a real product problem: presence. itto keeps following, reacting, and surviving while the brain reasons in the background, which is what makes it feel believable instead of robotic.

What we learned

building agents for games is less about making the model smarter and more about giving it the right body. if the bot can't move, survive, and sense the world reliably, the ai layer barely matters. the ceiling is set by perception and actions, not by the prompt.

we also learned "less chatty" wins. a good companion doesn't narrate everything. it listens, helps when asked, and speaks when there's actually something worth saying.

What's next for itto

deeper play: better memory for bases and chests, stronger long-horizon task execution, more reliable combat and building help, tighter voice integration. and we want to lean fully into the a16z vision, a companion that models the cool, chill, pro-social duo-partner energy a kid would actually want to play with.

after that, take itto cross-platform: the brain/body/mcp split isn't minecraft-specific, so the same harness can drive a companion in other games. one long-term gaming buddy, everywhere you play.

Built With

  • claude
  • discord
  • mcp
  • minecraft
  • mineflayer
  • next.js
  • typescript
Share this project:

Updates