Live Agent Ultron

Thumbnail
Login UI
Chat UI

Inspiration

The inspiration for Ultron came from a simple question: Why are our AI assistants trapped in a tab? We realized that while LLMs are incredibly smart, they often feel disconnected from our actual browsing habits. We wanted to build an assistant that doesn't just "talk" but actually executes. Our goal was to create a sleek, "HUD-like" interface that allows you to control your web experience through natural language—making the browser feel like a living, breathing extension of your intent.

What it does

Ultron is a multimodal AI command center built to streamline your digital life.

Natural Intent Execution: Say "open the place where I watch videos" and Ultron instantly launches YouTube.
Custom Keywords: Define personal shortcuts like hamburger → instagram.com for instant, non-AI navigation.
Multimodal Vision: Click the camera icon to let Ultron "see" your world and analyze snapshots in real-time.
Voice Intelligence: Full Speech-to-Text (STT) and Text-to-Speech (TTS) support for a hands-free, interactive experience.
Persistent Private Data: Secure, per-user storage for chat history and personal keywords.

How we built it

We built Ultron with a focus on speed and security:

AI Engine: Powered by Google Gemini 2.0 Flash via the @google/generative-ai SDK for low-latency multimodal reasoning.
Frontend: A custom-built, glassmorphic UI using Vanilla JavaScript (ES6) and modern CSS3. We opted for a framework-less approach to ensure near-instant load times.
Backend: A secure Node.js/Express server that handles API interactions, file processing with Multer, and persistence logic.
Deployment: The entire system is containerized with Docker and deployed on Google Cloud Run for serverless scalability.

Challenges we ran into

One major challenge was intent mapping. Ensuring the AI could distinguish between a general question and a command to open a specific website required precise system prompt engineering. We also faced hurdles with asynchronous media handling—synchronizing voice transcription, image snapshots, and streaming AI responses without blocking the main UI thread. Finally, we navigated the balance of statelessness on Cloud Run versus user persistence, ultimately building a lean, file-based data layer that ensures user history remains intact between deployments.

Accomplishments that we're proud of

Fluid Performance: Achieving a "human-like" typing experience through streaming responses that make the AI feel alive.
Multimodal Cohesion: Successfully merging vision and voice into a single interface that doesn't feel cluttered.
Command Accuracy: Building a system that accurately understands "fuzzy" requests and maps them to the correct web destinations.
Zero-Dependency UI: Creating a premium design from scratch without relying on heavy external UI kits.

What we learned

We discovered that Gemini 2.0 Flash is exceptionally good at "zero-shot" decision-making when given a list of available tools (commands). We also learned that modern Web APIs (like Web Speech and MediaDevices) are powerful enough to build sophisticated multimodal apps without needing proprietary desktop libraries. Build-wise, we reaffirmed that Vanilla JS often results in a better, snappier user experience than heavy frameworks when performance is priority #1.

What's next for Ultron

We see Ultron evolving into a "Proactive Agent."

Action Layers: Moving beyond just opening URLs to performing specific actions within sites (e.g., "Post this code snippet to GitHub").
External API Sync: Integrating with Google Calendar and Gmail to provide a truly unified assistant experience.
Contextual Learning: Automatically suggesting custom keywords based on a user's most frequent actions.
Mobile Companion: Bringing the Ultron experience to mobile browsers via a specialized PWA or Extension.

Built With

css3
docker
dotenv
express.js
github
google-ai-sdk
google-cloud
google-cloud-run
google-gemini-2.0-flash
html5
javascript
json
media-devices-api
mediadevices-api
multer
node.js
vanilla-javascript
web-api
web-speech-api
web-speech-api-(stt/tts)

Updates

Naman Roy posted an update — Mar 16, 2026 10:06 PM EDT

Just a side note- the agent can remember chat history information. That is if you gave it some info beforehand, it will have access to that when replying for similar questions

Log in or sign up for Devpost to join the conversation.

Naman Roy started this project — Mar 16, 2026 05:15 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.