Inspiration
We wanted a whiteboard that you can use without touching the mouse or keyboard—especially for accessibility and hands-free teaching or presenting. What if you could say "draw a red circle" or "write Hello" and see it appear? We built VocalCanvas to turn speech into shapes and text in real time, so the canvas responds to your voice.
What we learned
We learned how to wire speech (Web Speech API) through a gateway to an AI command parser (Gemini), then stream the result back over WebSocket so the UI feels responsive. We also learned to send live canvas state (current shapes and positions) with every request so the AI can return real shape IDs for "delete the red circle" or "move the last one"—solving the reference problem without a shared database.
How we built it
- Frontend (Next.js + React + Konva): Mic capture, WebSocket connection to our canvas service, and a canvas that renders circles, rectangles, and text. We send
{ speech, context: { shapes } }so the backend knows the current canvas. - Canvas service (Node + ws): WebSocket server that receives transcripts, calls the gateway parse API, and sends back
{ commands }so the frontend stays in sync. - Gateway (Express): Proxies parse requests to the command parser and session save/load to the session service.
- Command parser (FastAPI + Gemini): Converts natural language into structured JSON commands (DRAW, WRITE, MOVE, DELETE, RESIZE, ROTATE, CLEAR). The prompt uses
context.shapesand canvas dimensions so it returns exact shape IDs and supports "the one on the left" or "the big circle." - Session service (Express + Firestore): Optional save/load so users can persist and restore whiteboards.
We run everything locally (five services) for the demo; no deployment required.
Challenges we faced
- Timeouts: The LLM can be slow. We added a 45s timeout in the canvas service and 60s in the gateway so the chain (canvas → gateway → parser) completes before the client gives up.
- Shape references: Move/delete/resize need to target a specific shape. We fixed this by sending the current shape list (id, type, fill, pixels) with every request and instructing the model to return real IDs; we also added a frontend fallback that resolves placeholders like "last" or "the red circle" when the model returns them.
- 403 from Gemini: We caught 403s in the command parser and return 200 with an ERROR command plus a clear log message, so the app doesn’t break and developers know to check their API key (e.g. from Google AI Studio).
Built With
- axios
- dotenv
- express.js
- fastapi
- firestore
- gemini
- javascript
- konva-(react-konva)
- langchain
- next.js
- node.js
- pip/uvicorn
- python
- react
- web-speech-api
- ws-(websocket)

Log in or sign up for Devpost to join the conversation.