Inspiration
College students often rewatch lectures or reread slides to understand concepts, yet still struggle to get clear explanations in real time. We wanted to build a tutor that actually knows your lectures and can talk back naturally. So we built a lecture-aware voice tutor that allows students to select a course and lecture (or assignment), ask questions by voice or text, and receive spoken answers grounded directly in their slides, complete with citations. The tutor responds conversationally and can be interrupted mid-sentence, just like a real human tutor, creating a more natural and interactive learning experience. Our goal was not just to build another chatbot, but to design an AI system that integrates seamlessly into real academic workflows and can scale beyond a single interface.
How We Built It
We built the system with a Next.js frontend that provides a dashboard and real-time tutor interface, allowing students to switch between lecture mode and assignment mode while interacting through voice or text. The backend is powered by an Express server with WebSocket connections to maintain persistent, low-latency sessions. The server scans a structured courses directory, chunks and embeds lecture slides, and maintains an in-memory vector index to support Retrieval-Augmented Generation (RAG). Our AI and voice pipeline streams Speech-to-Text input, retrieves relevant slide context, generates streaming LLM responses, and converts them into real-time Text-to-Speech playback. A single WebSocket manages session state, audio streaming, and interruption handling. For assignments, we implemented a strict hint-only policy to ensure the tutor provides conceptual guidance and explanations without revealing full solutions.
A key architectural decision was designing StudyBuddy to function as both a native AI tutor and a reusable tool layer for a Microsoft Copilot Studio agent. To enable this, we built a set of REST API endpoints that expose StudyBuddy’s capabilities as agent-callable tools. These endpoints include GET /actions/listCourses to list all available courses and lectures, POST /actions/searchSlides to perform RAG-based vector search across slides, POST /actions/getSlides to retrieve full lecture content, and POST /actions/generateQuiz to dynamically generate quiz questions grounded in slide material. Each of these endpoints represents a structured capability that an external AI agent can invoke as part of its reasoning process.
We defined these tools in an openapi.yaml specification so Microsoft Copilot Studio can import the schema and automatically understand the available actions, parameters, and response formats. This allows Copilot Studio to treat StudyBuddy’s backend as a fully typed plugin without custom integration code. Internally, the same tool logic powers our WebSocket-based agent loop inside agentStreamResponse() in session.ts, where the LLM calls functions such as searchSlides or generateQuiz in-process during a conversation. Externally, the identical capabilities are exposed over HTTP through the /actions/* endpoints, enabling a Microsoft Copilot Studio agent to orchestrate retrieval, slide access, and quiz generation as plugin actions. This dual architecture cleanly separates orchestration from execution and demonstrates practical, scalable integration of Microsoft AI agents into a real-world academic system.
Throughout development, we leveraged Microsoft Copilot Agent tools to accelerate debugging, iterate on protocol design, and refine our architecture.
What We Learned
Building a real-time AI tutor requires more than connecting APIs; it requires orchestrating a carefully managed streaming system and designing clear boundaries between reasoning and execution. We learned how to coordinate the full pipeline from speech recognition to grounded retrieval to response generation and speech synthesis while maintaining low latency. We implemented barge-in support so users can interrupt mid-response, disabled microphone streaming during playback to prevent echo, flushed speech buffers to avoid self-transcription, and preserved full session history so the tutor can reference earlier questions naturally. We also reinforced strict grounding to ensure responses remain tied to lecture content and reduce hallucinations. Architecturally, we learned the importance of exposing domain-specific capabilities as structured tools via OpenAPI, enabling both internal LLM function calls and external Copilot Studio agent integration through the same unified interface.
Challenges
One of our main challenges was echo and self-transcription, where the tutor’s own voice was being captured as user input. We resolved this by disabling microphone streaming during playback and clearing the speech recognition buffer. Another challenge involved interruption handling, as previous audio responses continued playing after new questions were asked; this was solved by implementing immediate playback cancellation and resetting the streaming pipeline. Preventing assignment answer leakage required careful prompt engineering and guardrails to enforce a hint-only approach. Maintaining contextual continuity required preserving full conversation history while ensuring the system remained responsive and efficient. Designing the OpenAPI integration layer also required careful schema definition to ensure Copilot Studio could reliably interpret and invoke backend actions.
Result
The final result is a scalable, lecture-aware voice tutor that grounds answers in course slides, supports real-time interruption, maintains conversational memory, and differentiates clearly between lecture explanation mode and assignment hint mode. More importantly, StudyBuddy functions as both an intelligent tutoring interface and a reusable AI capability layer that can be orchestrated by a Microsoft Copilot Studio agent through standardized tool definitions. By integrating Microsoft AI agent tooling through OpenAPI-defined actions and structured plugin endpoints, we created an interruptible, citation-grounded AI tutor that feels natural, responsive, extensible, and ready to integrate into broader academic ecosystems.
Built With
- api
- audio
- azure
- copilot
- deepgram
- elevenlabs
- express-(backend)-apis-&-cloud-services-openai-api-(embeddings
- express.js
- gpt-4
- gpt-4o-mini-/-llm)-azure-openai-(optional-llm)-deepgram-(streaming-speech-to-text)-azure-speech-(optional-stt)-elevenlabs-(text-to-speech)-browser-web-speech-api-(tts-fallback)-backend-&-realtime-node.js
- javascript
- jspdf-web-audio-api-(mic-capture
- markdown
- next.js
- node.js
- node.js-frameworks-next.js-14-(frontend)
- openai
- parsing
- playback)-dev-&-tooling-typescript
- rag
- react
- react-markdown
- search
- speech
- tsx
- typescript
- vector
- web
- websockets
- websockets-(ws)-in-memory-vector-store-(rag-over-lecture-content)-pdf-parse-(pdf-lecture-extraction)-frontend-next.js
Log in or sign up for Devpost to join the conversation.