Inspiration
Training robots requires massive amounts of data, and on top of that, writing reward functions is a painful, time-consuming process that often takes longer than training the policy itself. I figured there had to be a way to just show a robot what to do instead of hand-coding every incentive. After writing about RL for lab automation on Medium and diving deeper into dexterous manipulation research, I wanted to try: can one human video replace all that manual reward engineering?
What it does
OVERFIT (One Video Episode Reward Function from Imitation Tracking) takes a single human manipulation video, uses Gemini 2.0 to analyze task structure and detect milestones, tracks hands with MediaPipe, retargets motion to a simulated Adroit hand, and generates shaped reward functions automatically. The system uses Residual RL, a behavior cloning baseline plus TD3 residual policy, so the robot learns to refine the demonstrated behavior rather than starting from scratch.
How we built it
MediaPipe for hand tracking, Gemini 3.0 Flash Preview for video analysis and automated reward generation, MuJoCo with the Adroit ShadowHand for simulation. I built a React dashboard for experiment management with a chat interface where Gemini helps iterate on reward designs in real-time. The training pipeline uses Stable Baselines3 (TD3) with a residual policy architecture. Built most of the prototyping with Gemini 3 and then Claude as a fallback when I hit rate limits. Based on concepts from DAPG and RRL for demo-augmented policy gradients and DextrAH-G for generalisable dexterous manipulation.
Challenges we ran into
Training was wildly unstable. Success rate would hit 30% then collapse back to zero. Balancing reward components against each other, making milestone detection robust, and closing the gap between human hand anatomy and robot hand kinematics took more iteration than expected. Getting Gemini to output complete, modular reward functions instead of partial snippets was its own challenge.
Accomplishments that we're proud of
The full pipeline works end to end. Raw human video goes in, Gemini analyzes it, generates reward code, and a trained dexterous manipulation policy comes out. The experiment designer lets you chat with Gemini to refine rewards without touching code. No hand-engineered reward terms anywhere in the loop.
What we learned
Reward design as much as the algorithm. Gemini 3.0's video understanding is useful for extracting task structure, and then reward shaping afterwards. Also learned how fragile RL training is without good reward signal, and how much Residual RL helps by giving the policy a reasonable starting point.
What's next for Overfit Labs
Multi-task support where each task only needs one video, sim-to-real transfer, and Gemini-powered automated reward iteration that analyses training curves and rewrites rewards when it detects failure modes. A fully end-to-end pipeline via API: upload a video, get a working policy back.
Log in or sign up for Devpost to join the conversation.