Inspiration

Robotics is often locked behind massive datasets, reinforcement learning, and months of training. We wanted to flip that on its head and let anyone “teach” a robot instantly — just by describing what they want. Inspired by Daniel Kahneman’s System 1 & System 2 thinking, we asked: what if robots could learn and act the same way?

What it does

Monkey See, Monkey Do turns natural language into real robot skills. Type “learn the jab punch” and our system searches, reasons, and executes that move on a custom-built robot in real time. A chat interface + 3D viewer makes the process simple, transparent, and interactive.

How we built it

  • Frontend: Next.js chat app + 3D robot viewer
  • Backend: Flask server coordinating skill learning
  • Reasoning (“System 2”): Tavily Search API + Cohere Agent to convert messy data into structured motion guides
  • Execution (“System 1”): SkillCompiler + RobotControlGenerator pipelines turning guides into servo sequences, streamed to an ESP32
  • Visualization: Arduino UNO with a TFT screen showing a monkey avatar reflecting the robot’s mood and state

Challenges we ran into

  • Translating noisy, unstructured web data into precise movement sequences
  • Timing + synchronization issues between frontend visualization and physical servo actions
  • Getting Socket.IO to handle real-time feedback without lag or dropped packets
  • Designing servo sequences that worked across multiple motions without breaking hardware
  • Debugging weird “robot movements” when it mis-interpreted data

Accomplishments that we're proud of

  • Built a fully working natural language → action pipeline in under 36 hours
  • Integrated Cohere, Tavily Search, ESP32, and a custom robot seamlessly
  • Created a playful visualization layer (the monkey avatar) that makes the system more engaging
  • Demonstrated true on-the-spot learning — no pretraining required

What we learned

  • The power (and pain) of combining reasoning LLMs with low-level robotics
  • How fragile real-time systems can be when every millisecond matters
  • That designing approachable user interfaces makes complex robotics feel simple and magical
  • The importance of aligning high-level “thought” with low-level “muscle”

What's next for Monkey See, Monkey Do

We see this framework as a step toward robots that adapt instantly to human intent, without pre-training or datasets. In education, it could let students teach robots experiments or routines by simply describing them; in industry, it could speed up prototyping by allowing engineers to define tasks in plain language; and in healthcare, it could give therapists and doctors flexible assistive robots tailored to patient needs. The innovation lies in replacing rigid training cycles with real-time adaptability, making robotics more accessible, scalable, and human-centered.

Built With

Share this project:

Updates