Inspiration

Millions of books, printed texts, and historical documents sit completely out of reach for AI systems. Not because the knowledge doesn't exist, but because nobody has automated the grunt work of getting it off the page. Manual digitisation is slow, tiresome, and does not scale. We asked a simple question: what if an agent could just read the book for you?

What We Built

Bookworm is an agentic book scanning pipeline built on the LeRobot SO100 arm. A Gemini-3.1-Flash model acts as the central decision maker, issuing tool calls to move the arm, capture pages, and trigger OCR and TTS as needed. The agent flips through a real physical book, extracts text from each page, and reads it back aloud to confirm the pipeline is working. The digitised output is structured as clean training data, ready for LLM pipelines via Cyberwave.

The TTS readback is not a demo flourish. It is a sanity check. If the agent read something coherent out loud, the full tool call chain worked correctly.

How We Built It

  • Gemini-3.1-Flash serves as the agentic brain, deciding when to flip, capture, read, and speak via tool calls
  • LeRobot SO100 executes physical page turning based on arm movement tool calls
  • OCR is invoked as a tool to extract text from each captured page image
  • Smallest AI STT/TTS is called as a tool for lightweight speech confirmation without heavy cloud overhead
  • Cyberwave handles downstream data collection and structuring for LLM training pipelines

Challenges

Getting consistent page flips without tearing or skipping is harder than it looks. Physical book geometry varies, spine stiffness, page weight, and curl all affect the arm's reliability. Synchronising the camera capture to the exact moment after a flip settles required careful timing. On the agentic side, keeping tool call latency low enough for the physical pipeline to feel fluid was a real coordination challenge.

What We Learned

Agentic pipelines that control physical hardware expose every latency problem you would normally paper over in software. The gap between a clean tool call chain in simulation and one that reliably handles paper in the real world is enormous and humbling. Closing the loop with audio confirmation also showed us how valuable cheap, observable feedback signals are when debugging hardware pipelines.

What's Next

Expanding to multi-language OCR for regional language texts and handwritten manuscripts. The long term vision is deploying Bookworm in libraries and archives to bulk digitise collections that would otherwise take decades to process manually, creating open datasets for underrepresented languages and domains that current LLMs know almost nothing about.

Built With

  • cyberwave
  • fastapi
  • gemini-3.1-flash
  • python
  • smallestai
Share this project:

Updates