Monkey See, Monkey Do

Inspiration

Robotics is often locked behind massive datasets, reinforcement learning, and months of training. We wanted to flip that on its head and let anyone “teach” a robot instantly — just by describing what they want. Inspired by Daniel Kahneman’s System 1 & System 2 thinking, we asked: what if robots could learn and act the same way?

What it does

Monkey See, Monkey Do turns natural language into real robot skills. Type “learn the jab punch” and our system searches, reasons, and executes that move on a custom-built robot in real time. A chat interface + 3D viewer makes the process simple, transparent, and interactive.

How we built it

Frontend: Next.js chat app + 3D robot viewer
Backend: Flask server coordinating skill learning
Reasoning (“System 2”): Tavily Search API + Cohere Agent to convert messy data into structured motion guides
Execution (“System 1”): SkillCompiler + RobotControlGenerator pipelines turning guides into servo sequences, streamed to an ESP32
Visualization: Arduino UNO with a TFT screen showing a monkey avatar reflecting the robot’s mood and state

Challenges we ran into

Translating noisy, unstructured web data into precise movement sequences
Timing + synchronization issues between frontend visualization and physical servo actions
Getting Socket.IO to handle real-time feedback without lag or dropped packets
Designing servo sequences that worked across multiple motions without breaking hardware
Debugging weird “robot movements” when it mis-interpreted data

Accomplishments that we're proud of

Built a fully working natural language → action pipeline in under 36 hours
Integrated Cohere, Tavily Search, ESP32, and a custom robot seamlessly
Created a playful visualization layer (the monkey avatar) that makes the system more engaging
Demonstrated true on-the-spot learning — no pretraining required

What we learned

The power (and pain) of combining reasoning LLMs with low-level robotics
How fragile real-time systems can be when every millisecond matters
That designing approachable user interfaces makes complex robotics feel simple and magical
The importance of aligning high-level “thought” with low-level “muscle”

What's next for Monkey See, Monkey Do

We see this framework as a step toward robots that adapt instantly to human intent, without pre-training or datasets. In education, it could let students teach robots experiments or routines by simply describing them; in industry, it could speed up prototyping by allowing engineers to define tasks in plain language; and in healthcare, it could give therapists and doctors flexible assistive robots tailored to patient needs. The innovation lies in replacing rigid training cycles with real-time adaptability, making robotics more accessible, scalable, and human-centered.