About the project

Inspiration

Agentic tools are everywhere, but most interfaces still assume you can chase tiny targets with a pointer and memorize nested menus. That breaks down fast for accessibility, hands-busy work, and anyone who wants intent in, action out without handing every utterance to a remote service.

We wanted the same "AI drives the computer" energy people talk about with desktop agents, but with a different north star: local-first, inspectable boundaries, and voice as a first-class input, not dictation dumped into a search box.

So we built VoiceAgents as a small suite with one thesis: two surfaces people actually live in (the open web and an open DAW), controlled through the same mental model.

What we built

VoiceAgents pairs:

Chromium voice agent (chromium-voice-agent/)
A browser extension prototype that turns spoken intent into navigation and control on real pages: tabs, search, scrolling, media, and site-specific flows, so the browser feels less like an obstacle course.
LMMS agent (lmmsagent/)
A DAW-side path through LMMS's AgentControl plugin boundary: transport, tracks, plugins, and session flow via voice and text, with commands staying in the session instead of on a stranger's GPU.

Same pattern in two industries: intent to grounded action to visible result.

We also ship a pitch site (index.html) that explains the story for judges and readers.

How we built it

Chromium: Manifest V3 extension structure (manifest.json), a service worker (background.js) for orchestration, speech capture / command path (speech.js), and a small popup UI (popup.html, popup.js) so status and controls stay legible.
LMMS: An AgentControl plugin under lmmsagent/integrations/lmms/AgentControl/, host patches where needed (integrations/lmms/patches/), and voice + text entry points (lmms-voice-agent/, lmms-text-agent/) sharing a normalized command client (lmmsagent/shared/).
Design doc: A longer architecture and roadmap lives in chromium-voice-agent/voice_agent_full_plan_v4.md so the implementation has a single north star (local inference, policy, grounding, UX).

At a high level, we treat the stack as a pipeline you can reason about:

$$ \text{utterance} \;\Rightarrow\; \text{intent} \;\Rightarrow\; \text{ground on UI/session} \;\Rightarrow\; \text{act} \;\Rightarrow\; \text{confirm / undo} $$

(You do not need heavy math to ship the demo. This just names the loop we optimized for.)

What we learned

"Agent" is not a chat bubble. The hard part is binding language to real controls (DOM, media, DAW state) with clear feedback when the mic is live.
Two domains, one discipline. Browser and DAW feel unrelated until you notice the same failure mode: high-dimensional UIs that punish motor and vision load. Voice lands when grounding and recovery are honest.
Local-first is a product decision. It changes what you promise: latency, privacy, and who owns the session.

Challenges

Grounding speech to real UI on arbitrary sites (and in a live DAW) without fragile one-off hacks, balancing regex / heuristics with optional semantic help when phrasing goes human.
Extension constraints: permissions, lifecycle, and keeping surface truth (users always know when audio or automation is in play).
LMMS integration: staying inside a clean plugin boundary while still making automation predictable for a hackathon timeline.
Scope: two agents means twice the demo risk. We focused on credible vertical slices rather than pretending to automate everything.

Try it / links

Repo layout and entry points: see README.md
Chromium: load unpacked extension from chromium-voice-agent/
LMMS: follow lmmsagent/ docs and scripts for build/install notes

Built With

Updates

Saks Srivastava started this project — Mar 29, 2026 10:41 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.