About the project

Inspiration

Agentic tools are everywhere, but most interfaces still assume you can chase tiny targets with a pointer and memorize nested menus. That breaks down fast for accessibility, hands-busy work, and anyone who wants intent in, action out without handing every utterance to a remote service.

We wanted the same "AI drives the computer" energy people talk about with desktop agents, but with a different north star: local-first, inspectable boundaries, and voice as a first-class input, not dictation dumped into a search box.

So we built VoiceAgents as a small suite with one thesis: two surfaces people actually live in (the open web and an open DAW), controlled through the same mental model.

What we built

VoiceAgents pairs:

  1. Chromium voice agent (chromium-voice-agent/)
    A browser extension prototype that turns spoken intent into navigation and control on real pages: tabs, search, scrolling, media, and site-specific flows, so the browser feels less like an obstacle course.

  2. LMMS agent (lmmsagent/)
    A DAW-side path through LMMS's AgentControl plugin boundary: transport, tracks, plugins, and session flow via voice and text, with commands staying in the session instead of on a stranger's GPU.

Same pattern in two industries: intent to grounded action to visible result.

We also ship a pitch site (index.html) that explains the story for judges and readers.

How we built it

  • Chromium: Manifest V3 extension structure (manifest.json), a service worker (background.js) for orchestration, speech capture / command path (speech.js), and a small popup UI (popup.html, popup.js) so status and controls stay legible.
  • LMMS: An AgentControl plugin under lmmsagent/integrations/lmms/AgentControl/, host patches where needed (integrations/lmms/patches/), and voice + text entry points (lmms-voice-agent/, lmms-text-agent/) sharing a normalized command client (lmmsagent/shared/).
  • Design doc: A longer architecture and roadmap lives in chromium-voice-agent/voice_agent_full_plan_v4.md so the implementation has a single north star (local inference, policy, grounding, UX).

At a high level, we treat the stack as a pipeline you can reason about:

$$ \text{utterance} \;\Rightarrow\; \text{intent} \;\Rightarrow\; \text{ground on UI/session} \;\Rightarrow\; \text{act} \;\Rightarrow\; \text{confirm / undo} $$

(You do not need heavy math to ship the demo. This just names the loop we optimized for.)

What we learned

  • "Agent" is not a chat bubble. The hard part is binding language to real controls (DOM, media, DAW state) with clear feedback when the mic is live.
  • Two domains, one discipline. Browser and DAW feel unrelated until you notice the same failure mode: high-dimensional UIs that punish motor and vision load. Voice lands when grounding and recovery are honest.
  • Local-first is a product decision. It changes what you promise: latency, privacy, and who owns the session.

Challenges

  • Grounding speech to real UI on arbitrary sites (and in a live DAW) without fragile one-off hacks, balancing regex / heuristics with optional semantic help when phrasing goes human.
  • Extension constraints: permissions, lifecycle, and keeping surface truth (users always know when audio or automation is in play).
  • LMMS integration: staying inside a clean plugin boundary while still making automation predictable for a hackathon timeline.
  • Scope: two agents means twice the demo risk. We focused on credible vertical slices rather than pretending to automate everything.

Try it / links

  • Repo layout and entry points: see README.md
  • Chromium: load unpacked extension from chromium-voice-agent/
  • LMMS: follow lmmsagent/ docs and scripts for build/install notes

Built With

Share this project:

Updates