VocalCanvas

Welcome Screen
Info
Working State
Load Screen
Save Screen

Inspiration

We wanted a whiteboard that you can use without touching the mouse or keyboard—especially for accessibility and hands-free teaching or presenting. What if you could say "draw a red circle" or "write Hello" and see it appear? We built VocalCanvas to turn speech into shapes and text in real time, so the canvas responds to your voice.

What we learned

We learned how to wire speech (Web Speech API) through a gateway to an AI command parser (Gemini), then stream the result back over WebSocket so the UI feels responsive. We also learned to send live canvas state (current shapes and positions) with every request so the AI can return real shape IDs for "delete the red circle" or "move the last one"—solving the reference problem without a shared database.

How we built it

Frontend (Next.js + React + Konva): Mic capture, WebSocket connection to our canvas service, and a canvas that renders circles, rectangles, and text. We send { speech, context: { shapes } } so the backend knows the current canvas.
Canvas service (Node + ws): WebSocket server that receives transcripts, calls the gateway parse API, and sends back { commands } so the frontend stays in sync.
Gateway (Express): Proxies parse requests to the command parser and session save/load to the session service.
Command parser (FastAPI + Gemini): Converts natural language into structured JSON commands (DRAW, WRITE, MOVE, DELETE, RESIZE, ROTATE, CLEAR). The prompt uses context.shapes and canvas dimensions so it returns exact shape IDs and supports "the one on the left" or "the big circle."
Session service (Express + Firestore): Optional save/load so users can persist and restore whiteboards.

We run everything locally (five services) for the demo; no deployment required.

Challenges we faced

Timeouts: The LLM can be slow. We added a 45s timeout in the canvas service and 60s in the gateway so the chain (canvas → gateway → parser) completes before the client gives up.
Shape references: Move/delete/resize need to target a specific shape. We fixed this by sending the current shape list (id, type, fill, pixels) with every request and instructing the model to return real IDs; we also added a frontend fallback that resolves placeholders like "last" or "the red circle" when the model returns them.
403 from Gemini: We caught 403s in the command parser and return 200 with an ERROR command plus a clear log message, so the app doesn’t break and developers know to check their API key (e.g. from Google AI Studio).

Built With

axios
dotenv
express.js
fastapi
firestore
gemini
google
javascript
konva-(react-konva)
langchain
next.js
node.js
pip/uvicorn
python
react
web-speech-api
ws-(websocket)

Submitted to

DeerHacks V (2026)

Created by

Jaejo Antony
Rayyan Aamir
I'm a senior student at UofT, pursuing a Math and CS double major. I'm into ML, research, full-stack, and game development.
Zuhayr Ahmed
Anirudh Navin Arbatti
4th Year CS @ UofT | AWS Certified | Building AgenticProjectMentor: AI-driven LangChain orchestration. UX enthusiast. Hirable Sept 2026

Updates

Rayyan Aamir started this project — Feb 28, 2026 09:16 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.