About the project
Inspiration
Agentic tools are everywhere, but most interfaces still assume you can chase tiny targets with a pointer and memorize nested menus. That breaks down fast for accessibility, hands-busy work, and anyone who wants intent in, action out without handing every utterance to a remote service.
We wanted the same "AI drives the computer" energy people talk about with desktop agents, but with a different north star: local-first, inspectable boundaries, and voice as a first-class input, not dictation dumped into a search box.
So we built VoiceAgents as a small suite with one thesis: two surfaces people actually live in (the open web and an open DAW), controlled through the same mental model.
What we built
VoiceAgents pairs:
Chromium voice agent (
chromium-voice-agent/)
A browser extension prototype that turns spoken intent into navigation and control on real pages: tabs, search, scrolling, media, and site-specific flows, so the browser feels less like an obstacle course.LMMS agent (
lmmsagent/)
A DAW-side path through LMMS's AgentControl plugin boundary: transport, tracks, plugins, and session flow via voice and text, with commands staying in the session instead of on a stranger's GPU.
Same pattern in two industries: intent to grounded action to visible result.
We also ship a pitch site (index.html) that explains the story for judges and readers.
How we built it
- Chromium: Manifest V3 extension structure (
manifest.json), a service worker (background.js) for orchestration, speech capture / command path (speech.js), and a small popup UI (popup.html,popup.js) so status and controls stay legible. - LMMS: An AgentControl plugin under
lmmsagent/integrations/lmms/AgentControl/, host patches where needed (integrations/lmms/patches/), and voice + text entry points (lmms-voice-agent/,lmms-text-agent/) sharing a normalized command client (lmmsagent/shared/). - Design doc: A longer architecture and roadmap lives in
chromium-voice-agent/voice_agent_full_plan_v4.mdso the implementation has a single north star (local inference, policy, grounding, UX).
At a high level, we treat the stack as a pipeline you can reason about:
$$ \text{utterance} \;\Rightarrow\; \text{intent} \;\Rightarrow\; \text{ground on UI/session} \;\Rightarrow\; \text{act} \;\Rightarrow\; \text{confirm / undo} $$
(You do not need heavy math to ship the demo. This just names the loop we optimized for.)
What we learned
- "Agent" is not a chat bubble. The hard part is binding language to real controls (DOM, media, DAW state) with clear feedback when the mic is live.
- Two domains, one discipline. Browser and DAW feel unrelated until you notice the same failure mode: high-dimensional UIs that punish motor and vision load. Voice lands when grounding and recovery are honest.
- Local-first is a product decision. It changes what you promise: latency, privacy, and who owns the session.
Challenges
- Grounding speech to real UI on arbitrary sites (and in a live DAW) without fragile one-off hacks, balancing regex / heuristics with optional semantic help when phrasing goes human.
- Extension constraints: permissions, lifecycle, and keeping surface truth (users always know when audio or automation is in play).
- LMMS integration: staying inside a clean plugin boundary while still making automation predictable for a hackathon timeline.
- Scope: two agents means twice the demo risk. We focused on credible vertical slices rather than pretending to automate everything.
Try it / links
- Repo layout and entry points: see
README.md - Chromium: load unpacked extension from
chromium-voice-agent/ - LMMS: follow
lmmsagent/docs and scripts for build/install notes
Log in or sign up for Devpost to join the conversation.