Smart Interview
Inspiration
We noticed that most mock interview platforms rely purely on text-based interactions — you type your answers into a chatbot. The few that offer voice input are English-only, which excludes non-native English speakers who want to practice in their native language.
But the bigger problem? Zero platforms support American Sign Language. Deaf and hard-of-hearing candidates are completely locked out of AI-powered interview prep tools. Even with human interpreters in real interviews, there's often delay and loss of nuance.
We wanted to fix both problems: build a voice-first interview platform that supports multiple languages (starting with English and Spanish) and — for the first time — make it accessible to ASL users through real-time computer vision. No more typing. No more barriers. Just practice the way you'd actually interview.
What it does
Smart Interview reads your resume and conducts personalized mock interviews in English voice, Spanish voice, or American Sign Language.
Here's the flow:
- Upload your resume — we parse it with
pdfplumber, extract sections (projects, skills, experience), and store embeddings in ChromaDB - Choose your language — English, Spanish, or ASL
- Get personalized questions — Groq's LLaMA 3.3 70B generates 8 technical questions based on your actual resume content
- Answer in your preferred mode:
- Voice (EN/ES): Web Speech API transcribes, ElevenLabs speaks the questions
- ASL: Your camera captures hand landmarks via MediaPipe, our RandomForest classifier recognizes signs letter-by-letter, assembles words
- RAG-powered follow-ups — After each answer, we retrieve relevant context from your resume and generate a contextual follow-up question
- Resume screening — 4 ML models predict your resume category, recommend job roles, and extract skills/education
Every interview adapts to your experience — no generic LeetCode questions. If your resume mentions React, we ask about React. If you built a distributed system, we drill into that.
How we built it
Frontend (Next.js 15 + TypeScript)
- Auth flow with Supabase (email/password)
- Setup page: resume upload + language selection
- Three interview modes:
- English/Spanish Voice: Web Speech API (
en-US/es-ES) for input, ElevenLabs TTS for output - ASL: MediaStream API captures camera frames, sends base64-encoded images to backend at $f = 10$ FPS
- English/Spanish Voice: Web Speech API (
- Dashboard with session history and resume screening results
- Framer Motion animations, Tailwind styling, React Three Fiber for 3D visuals
Backend (FastAPI + Python 3.13)
Resume Parsing & RAG Pipeline
PDF Extraction:
Resume.pdf → pdfplumber → Text Extraction → Section Chunking
We split documents by semantic sections (Experience, Projects, Skills) rather than fixed-size windows. For each chunk $c_i$:
$$\text{embedding}(c_i) = \text{MiniLM}(c_i) \in \mathbb{R}^{384}$$
These embeddings are stored in ChromaDB for similarity search.
Question Generation
Given resume embeddings ${e_1, e_2, \ldots, e_n}$, we prompt Groq LLaMA 3.3 70B:
$$Q = \text{LLM}(\text{prompt} \mid \text{resume_context})$$
The model generates:
- 8 technical questions based on projects/skills
- 5 behavioral questions (STAR method)
RAG-Powered Follow-ups
After user answers question $Q_i$ with response $R_i$:
Query ChromaDB for top-$k$ relevant chunks: $$\text{chunks} = \text{TopK}\left(\text{similarity}(\text{embedding}(R_i), {e_1, \ldots, e_n}), k=3\right)$$
Generate follow-up: $$Q_{i+1} = \text{LLM}(Q_i, R_i, \text{chunks})$$
This ensures follow-ups are contextually grounded in the candidate's actual experience.
ASL Recognition Pipeline
Hand Landmark Extraction:
MediaPipe detects 21 hand landmarks per frame. For landmark $j$:
$$\mathbf{p}_j = (x_j, y_j, z_j) \in \mathbb{R}^3$$
Feature vector for one frame:
$$\mathbf{f} = [x_1, y_1, z_1, x_2, y_2, z_2, \ldots, x_{21}, y_{21}, z_{21}] \in \mathbb{R}^{63}$$
Sign Classification:
We trained a RandomForest classifier on 87,000 images from the Kaggle ASL Alphabet dataset
Text-to-Speech
ElevenLabs API with model selection:
- Spanish:
eleven_multilingual_v2 - English:
eleven_turbo_v2
Average latency: $\mu = 1.8$ seconds, $\sigma = 0.4$ seconds
Tech Stack Summary
| Layer | Technology |
|---|---|
| Frontend | Next.js 15, React 19, TypeScript, Tailwind CSS, Framer Motion |
| Backend | FastAPI, Python 3.13, Uvicorn |
| LLM | Groq — LLaMA 3.3 70B (via OpenRouter-compatible client) |
| RAG | ChromaDB (in-memory), MiniLM embeddings, pdfplumber |
| TTS | ElevenLabs (eleven_multilingual_v2 for Spanish, eleven_turbo_v2 for English) |
| Speech Input | Web Speech API (en-US / es-ES) |
| ASL | MediaPipe Hand Landmarks + scikit-learn RandomForest classifier |
| Auth & DB | Supabase (PostgreSQL + Auth) |
| 3D / Visuals | React Three Fiber, Three.js |
Challenges we ran into
1. ASL Recognition Accuracy
Problem: Initial attempts with pre-trained models failed — they couldn't handle real-world lighting, hand angles, and signing speed.
Solution: We trained a custom RandomForest on 87,000 images, then built a letter-buffering system with temporal smoothing
2. RAG Context Retrieval
Problem: Early versions retrieved irrelevant chunks (e.g., asking about Java when the answer mentioned JavaScript).
Mathematical formulation of the issue:
Given query $q$ and chunks ${c_1, \ldots, c_n}$, naive cosine similarity:
$$\text{score}(q, c_i) = \frac{\text{embedding}(q) \cdot \text{embedding}(c_i)}{|\text{embedding}(q)| |\text{embedding}(c_i)|}$$
was retrieving chunks with high lexical overlap but low semantic relevance.
Solution: We improved chunking strategy — splitting by resume section instead of fixed-size windows (previously $|\text{chunk}| = 512$ tokens) — and added a semantic similarity threshold $\tau = 0.65$:
$$\text{retrieved} = {c_i \mid \text{score}(q, c_i) > \tau}$$
3. Multilingual TTS Latency
Problem: ElevenLabs API calls had latency $\mu \approx 2.5$ seconds, breaking conversation flow.
Solution:
- Implemented audio streaming (start playback before full response is generated)
- Preloaded common phrases ("Great answer", "Let's move on")
- Reduced latency to $\mu \approx 1.2$ seconds
4. Web Speech API Reliability
Problem: SpeechRecognition would randomly stop listening in Chrome after $t \sim 30$ seconds.
Solution: Implemented exponential backoff retries with maximum wait time:
$$\text{retry_delay} = \min(2^n \cdot 100\text{ms}, 5000\text{ms})$$
where $n$ is the retry attempt number. Added visual indicator so users know when the mic is active.
5. Resume PDF Parsing Edge Cases
Problem: Some resumes had tables, multi-column layouts, or scanned images. pdfplumber extraction quality varied significantly.
Solution: Added fallback logic:
if pdfplumber fails:
OCR with pytesseract
Clean with regex (remove artifacts, normalize whitespace)
Success rate improved from 78% to 96% across diverse resume formats.
6. ChromaDB In-Memory Limitations
Problem: Vectorstore resets between sessions. For $n$ users, we need $O(n)$ storage that persists.
Temporary solution: For hackathon demo, we optimized cold-start time by caching embeddings in Supabase. Average initialization time: $t_{\text{init}} = 1.2$ seconds.
Production plan: Migrate to disk-backed ChromaDB or Pinecone for persistent vector storage.
Accomplishments that we're proud of
- First ASL-powered technical interview platform — we couldn't find any existing tool that supports sign language for technical interviews
- Resume-aware question generation that actually reads your CV instead of asking generic questions
- Full Spanish localization — questions, follow-ups, and TTS all work end-to-end in Spanish
- RAG pipeline with similarity threshold $\tau = 0.65$ that makes follow-ups feel natural and contextual, not robotic
- Clean UI/UX with smooth transitions, 3D backgrounds (React Three Fiber), and accessibility-first design
What we learned
ASL is harder than we thought
We initially assumed we could use a pre-trained model, but the variability in hand shapes, lighting, and motion required custom training and buffering logic. Key insight: temporal smoothing is essential:
$$\text{stability} \propto \text{window_size}$$
We found optimal window $w^* = 15$ frames (empirically tested $w \in {5, 10, 15, 20, 30}$).
RAG chunk quality matters more than model size
We spent hours tweaking our chunking strategy (section-aware vs. fixed-size) and realized that:
$$\text{Performance} \propto \text{Chunk Quality} \times \text{Model Capacity}$$
Better chunks + smaller model (LLaMA 3.3 70B) outperformed worse chunks + GPT-4.
Latency kills conversational AI
Every $\Delta t = 500$ms delay in TTS or transcription makes the interview feel robotic. We optimized aggressively:
- Streaming audio (start playback at $t_{\text{first_byte}}$ instead of $t_{\text{complete}}$)
- Prefetching common responses
- Reducing API round-trips from $O(n)$ to $O(1)$ per question
Accessibility is not an add-on
Building ASL support from day one shaped our entire architecture. It forced us to think about:
- Visual feedback (users need to see when the system is "listening")
- Alternative input methods (voice, sign, text)
- Internationalization (not just translating strings, but rethinking UX flow)
FastAPI + Next.js is a killer combo
- FastAPI's async support made handling $f = 10$ FPS ASL frames trivial
- Next.js 15's server actions simplified auth and API calls
- Total backend latency for ASL frame processing: $\mu = 45$ms (MediaPipe) + $\mu = 12$ms (RandomForest inference)
What's next for Smart Interview
1. Expand ASL Vocabulary
Current: Alphabet only ($|\Sigma| = 26$ letters)
Goal: Train models for common tech terms as single signs:
$$\text{Sign}(\text{"array"}) \neq \text{Sign}(A) + \text{Sign}(R) + \text{Sign}(R) + \text{Sign}(A) + \text{Sign}(Y)$$
Target vocabulary: $|\Sigma_{\text{tech}}| \approx 500$ terms (function, loop, class, database, API, etc.)
2. Speech-to-Sign TTS
Problem: ASL users currently read questions as text.
Solution: Generate sign language videos (avatar or human) for questions. Two approaches:
- Rule-based: Map sentence → gloss notation → avatar animation
- Neural: Train seq2seq model $\text{English} \to \text{Sign Sequence}$
3. Persistent Sessions
Schema design:
CREATE TABLE interview_sessions (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
questions JSONB,
transcripts JSONB,
scores JSONB,
created_at TIMESTAMP
);
Users can review past interviews and track improvement: $\text{score}(t) = f(t)$ where $f$ is fitted trend line.
4. Live Feedback Scoring
Use an LLM evaluator to score answers in real-time:
$$\text{Score}{\text{behavioral}} = \text{LLM}(\text{answer}, \text{rubric}{\text{STAR}})$$
$$\text{Score}{\text{technical}} = \text{LLM}(\text{answer}, \text{rubric}{\text{correctness}})$$
Provide suggestions: "Your answer lacks the 'Result' component of STAR."
5. More Languages
Add support for:
- Mandarin (zh-CN)
- Hindi (hi-IN)
- Arabic (ar-SA)
Challenge: TTS models with low latency for tonal languages (Mandarin: 4 tones $\times$ phonemes).
6. Code Execution
Feature: Monaco editor in-browser + test case runner.
Flow:
- User writes code: $C(x)$
- Run against test cases: ${(x_1, y_1), \ldots, (x_k, y_k)}$
- Evaluate: $\text{Passed} = \sum_{i=1}^k \mathbb{1}[C(x_i) = y_i]$
Similar to LeetCode but with personalized questions from resume.
7. Collaborative Interviews
Support pair programming scenarios:
- Two users $(U_1, U_2)$ share same session
- Real-time code collaboration (CRDT-based sync)
- Useful for team fit assessments
8. Mobile App
Constraints:
- Camera access: iOS requires
AVCaptureSession, Android requiresCamera2 API - Offline ASL recognition: Bundle TFLite model ($\sim 15$ MB) for on-device inference
- Latency target: $\mu < 100$ms (currently $\mu = 157$ms on web)
Built for HackUSF 2026 🚀
Built With
- fastapi
- mediapipe
- python
- react
- scikit-learn
- supabase
Log in or sign up for Devpost to join the conversation.