Skip to content

abhinandan2699/study-buddy

Repository files navigation

StudyBuddy — Clemson Lecture-Aware Voice Tutor

A real-time voice tutor that answers questions about your lecture content using RAG (Retrieval-Augmented Generation), streaming STT, LLM, and TTS with full barge-in/interrupt support.


What was done (for reference)

  • Voice when ElevenLabs fails: If ElevenLabs returns an error (e.g. quota exceeded or voice not found), the server sends tts.fallback with the reply text and the browser uses built-in TTS (your OS voice) so you still get spoken replies.
  • Echo fix: The mic was being transcribed while the tutor was speaking, so the speaker output was showing up as your next message. Now the server does not forward mic audio to STT while the tutor is in the "speaking" state, and the Azure STT buffer is flushed when the tutor starts speaking, so the tutor’s voice is never transcribed as you. Use the interrupt (stop) button to cut off the tutor and ask a new question.
  • ElevenLabs 404 voice_not_found: If the configured ELEVENLABS_VOICE_ID is not found (404), the app retries once with the default voice so TTS can work without changing .env.
  • Interrupt: Interrupt (stop) now also cancels browser TTS (SpeechSynthesis) so playback stops immediately.
  • Stop old voice when starting a new question: When you type or speak a new question while the tutor is still talking, the old TTS is stopped so the new answer's voice plays from the start. The client stops playback when: (1) you send a new message (Send or Enter), and (2) when the server signals a new response (tutor.state === 'thinking'), so the transition to the next answer is smooth with no overlap.
  • Conversation memory: The tutor now receives the full conversation history for the current session (all previous questions and answers), not just the last few turns, so it can refer back to earlier questions and give consistent, context-aware answers.

Where the voice tutor lives: The tutor UI with all the above fixes is at /courses/[courseId]/study-buddy (e.g. open Dashboard → click a course → click StudyBuddy). The root / shows the dashboard (course cards), not the old single-page tutor.

  • Lecture vs Assignment mode: In Session Setup you can choose Topic type: Lecture or Assignment. For Lecture, the tutor uses RAG over that lecture’s slides (same as before). For Assignment, it uses the selected assignment’s markdown: it gives overview, deliverables, and hints only — it will not give full solutions. If the student asks for the answer or solution, it politely declines and offers hints instead.

Quick Start

Prerequisites

  • Node.js 18+ (recommend 20+)
  • npm 9+
  • API keys (see below)

1. Install dependencies

npm install

2. Configure environment

cp .env.example .env

Edit .env and add your API keys:

Key Required? Notes
OPENAI_API_KEY Yes (if using OpenAI) For embeddings + LLM
DEEPGRAM_API_KEY Recommended Streaming speech-to-text
ELEVENLABS_API_KEY Optional Voice output; text-only without it
ELEVENLABS_VOICE_ID Optional Defaults to a built-in voice

3. Build shared types

npm run build:shared

4. Start the app

npm run dev

This starts:

  • Backend at http://localhost:3001 (WebSocket at ws://localhost:3001/ws)
  • Frontend at http://localhost:3000

5. Use it

  1. Open http://localhost:3000 (Dashboard).
  2. Click a course (e.g. CPSC1020), then click StudyBuddy to open the voice tutor (/courses/CPSC1020/study-buddy).
  3. Select a lecture from the dropdown and click Start Tutor.
  4. Click the mic button or type a question.
  5. The tutor answers using voice + text with slide citations.
  6. Interrupt anytime by clicking the red stop button or by sending a new message (old voice stops, new answer plays).

Lecture Directory Structure

Place your lectures under the lectures/ directory (or set LECTURE_ROOT in .env):

lectures/
  CPSC2120/
    Lecture01/
      slides.txt    # or slides.pdf
    Lecture02/
      slides.pdf
  STAT3090/
    Lecture01/
      slides.txt

Text file format

Use --- slide N --- markers:

--- slide 1 ---
Title and content of slide 1...

--- slide 2 ---
Content of slide 2...

PDF files

Each page is treated as one slide. Text is extracted automatically (best-effort).

Architecture

studybuddy/
├── apps/
│   ├── server/          # Node.js + Express + WebSocket backend
│   │   └── src/
│   │       ├── index.ts          # Entry point
│   │       ├── config.ts         # Environment config
│   │       ├── session.ts        # WebSocket session handler
│   │       └── services/
│   │           ├── lectureStore.ts  # Scans & parses lectures
│   │           ├── embeddings.ts    # OpenAI embeddings
│   │           ├── vectorIndex.ts   # In-memory vector search
│   │           ├── llm.ts           # Streaming LLM
│   │           ├── tts.ts           # ElevenLabs TTS
│   │           └── stt.ts           # Deepgram/Whisper STT
│   └── web/             # Next.js frontend
│       └── src/
│           ├── app/page.tsx        # Main UI
│           └── lib/
│               ├── ws-client.ts    # WebSocket client
│               ├── mic-capture.ts  # Mic → PCM16
│               └── audio-player.ts # Audio queue + playback
├── packages/
│   └── shared/          # Shared TypeScript types
├── lectures/            # Sample lecture content
├── .env.example
└── README.md

Realtime Protocol

WebSocket messages use JSON. See packages/shared/src/index.ts for full type definitions.

Client → Server:

  • session.start — begin tutoring session for a course/lecture
  • audio.chunk — streaming mic audio (PCM16 base64)
  • user.text — typed question
  • interrupt — cancel current response
  • session.stop — end session

Server → Client:

  • session.ready — available courses/lectures
  • stt.partial / stt.final — speech transcription
  • rag.citations — retrieved slide references
  • llm.token — streaming LLM tokens
  • tts.audio — audio chunks (MP3 base64)
  • tutor.state — listening / thinking / speaking
  • error — error message

Troubleshooting

"Microphone not working"

  • Ensure you're on localhost (or HTTPS) — browsers require secure contexts for mic access
  • Check browser permissions: click the lock icon in the address bar
  • Try Chrome or Edge (best WebAudio support)

"Tutor's own voice is transcribed as my next message" (echo)

  • The app mutes STT while the tutor is speaking: mic audio is not sent to speech-to-text during TTS playback, so the speaker output is not transcribed as your next question. Use the interrupt (stop) button to cut off the tutor and ask a new question.

Still seeing echo or old voice playing after a new question?

  • Use the correct page: Open Dashboard → click a course → click StudyBuddy. The tutor is at /courses/[courseId]/study-buddy.
  • Echo with browser TTS: If ElevenLabs fails (e.g. invalid API key), the app uses browser TTS. The client now gates mic: it does not send audio to the server while tutor voice is playing (ElevenLabs or browser). So you should no longer see the tutor's reply transcribed as your message. Restart the app and do a hard refresh (Cmd+Shift+R / Ctrl+Shift+R).

"No courses found"

  • Check that LECTURE_ROOT in .env points to a directory with the correct structure
  • Click "Rescan Lectures" in the UI
  • Check server logs for scan output

"No voice output" / "Voice used to work but now it's silent"

  • ElevenLabs quota: If the server logs [tts] ElevenLabs error 401 with quota_exceeded, your API key has run out of credits. The app will automatically fall back to browser TTS (your OS voice) so you still get spoken replies; text is unchanged.
  • To use ElevenLabs again: top up credits at elevenlabs.io or add a new API key in .env.
  • Without ELEVENLABS_API_KEY at all, the tutor uses text-only mode (or browser TTS when the server sends tts.fallback).

"WebSocket connection failed"

  • Make sure the server is running on port 3001
  • Check for firewall/proxy issues
  • Look at browser console for connection errors

"Embeddings failed"

  • Verify OPENAI_API_KEY is set and valid
  • Check server logs for API errors
  • The app caches embeddings in .cache/embeddings.json for faster restarts

Tech Stack

  • Frontend: Next.js 14, React 18, Web Audio API
  • Backend: Node.js, Express, ws (WebSocket)
  • STT: Deepgram (streaming) or Whisper (fallback)
  • LLM: OpenAI GPT-4o-mini (or Azure OpenAI)
  • TTS: ElevenLabs
  • RAG: In-memory vector search with OpenAI embeddings

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors