pAIMaestro

Gesture detection via mediaPipe
pAImaestro System
System Base Internals

Inspiration

Programming can be demanding and mentally exhausting, which is why new tools and languages have been created to reduce friction and improve productivity. Our product aims to deliver comparable quality-of-life improvements to the music-composition process.

What it does (Project Description)

Our product provides two additional methods for users to compose music. Hand gestures tracking deployed on Vultr can be made to do things like raise the pitch of the current context pane, alter speed, alter key, playback the current track, pause, playback from checkpoint, create playback checkpoints, and much more. Additionally, users can generate entire sections of music within the current composition through voice prompt, powered by ElevenLabs and interpreted by Gemini. If a generated section is undesirable, it can be regenerated quickly, enabling rapid iteration without disrupting the workflow. Users still maintain the ability to edit note-by-note if they prefer more granular control. The composition software includes features such as changing keys and notes via keyboard shortcuts, generating and regenerating sections, and saving & loading compositions. Music scores can be saved to and downloaded from our database powered by MongoDB across multiple devices.

How we built it

A Raspberry Pi 4 runs Linux which runs all the necessary software that is needed for the product to function (excluding MediaPipe which is deployed via Vultr). The Raspberry Pi acts as the main orchestration node and maintains the authoritative composition state. The Raspberry Pi is connected to a speaker, a microphone, a large display and other basic computer accessories (like a keyboard & mouse), effectively forming a standalone composition workstation. Our demo composition software is built using Python and a Python game/graphics library, which receives inputs from our MediaPipe server that converts hand gestures into direct actions. Due to performance and compatibility limitations on the Raspberry Pi 4 (ARM architecture and limited compute), MediaPipe is deployed on a Vultr cloud instance instead of running locally. The Raspberry Pi streams video to the Vultr server, where hand landmarks are extracted and classified into predefined gesture categories. These gestures are then converted into structured commands and sent back to the Pi over a secure Tailscale VPN connection. This allows us to offload compute-heavy vision processing while maintaining low-latency control. ElevenLabs is used to convert user speech to text, which is then fed into Gemini to parse and create changes within our composition's JSON file. Rather than directly executing free-form LLM output, we constrain responses into a predefined JSON schema to ensure predictable and valid modifications to the musical state. Both gesture and voice inputs ultimately resolve into the same internal command pipeline, which simplifies state management and makes the system easier to extend.