Smart Interview

Inspiration

We noticed that most mock interview platforms rely purely on text-based interactions — you type your answers into a chatbot. The few that offer voice input are English-only, which excludes non-native English speakers who want to practice in their native language.

But the bigger problem? Zero platforms support American Sign Language. Deaf and hard-of-hearing candidates are completely locked out of AI-powered interview prep tools. Even with human interpreters in real interviews, there's often delay and loss of nuance.

We wanted to fix both problems: build a voice-first interview platform that supports multiple languages (starting with English and Spanish) and — for the first time — make it accessible to ASL users through real-time computer vision. No more typing. No more barriers. Just practice the way you'd actually interview.

What it does

Smart Interview reads your resume and conducts personalized mock interviews in English voice, Spanish voice, or American Sign Language.

Here's the flow:

Upload your resume — we parse it with pdfplumber, extract sections (projects, skills, experience), and store embeddings in ChromaDB
Choose your language — English, Spanish, or ASL
Get personalized questions — Groq's LLaMA 3.3 70B generates 8 technical questions based on your actual resume content
Answer in your preferred mode:
- Voice (EN/ES): Web Speech API transcribes, ElevenLabs speaks the questions
- ASL: Your camera captures hand landmarks via MediaPipe, our RandomForest classifier recognizes signs letter-by-letter, assembles words
RAG-powered follow-ups — After each answer, we retrieve relevant context from your resume and generate a contextual follow-up question
Resume screening — 4 ML models predict your resume category, recommend job roles, and extract skills/education

Every interview adapts to your experience — no generic LeetCode questions. If your resume mentions React, we ask about React. If you built a distributed system, we drill into that.

How we built it

Frontend (Next.js 15 + TypeScript)

Auth flow with Supabase (email/password)
Setup page: resume upload + language selection
Three interview modes:
- English/Spanish Voice: Web Speech API (en-US/es-ES) for input, ElevenLabs TTS for output
- ASL: MediaStream API captures camera frames, sends base64-encoded images to backend at $f = 10$ FPS
Dashboard with session history and resume screening results
Framer Motion animations, Tailwind styling, React Three Fiber for 3D visuals

Backend (FastAPI + Python 3.13)

Resume Parsing & RAG Pipeline

PDF Extraction:

Resume.pdf → pdfplumber → Text Extraction → Section Chunking

We split documents by semantic sections (Experience, Projects, Skills) rather than fixed-size windows. For each chunk $c_i$:

$$\text{embedding}(c_i) = \text{MiniLM}(c_i) \in \mathbb{R}^{384}$$

These embeddings are stored in ChromaDB for similarity search.

Question Generation

Given resume embeddings ${e_1, e_2, \ldots, e_n}$, we prompt Groq LLaMA 3.3 70B:

$$Q = \text{LLM}(\text{prompt} \mid \text{resume_context})$$

The model generates:

8 technical questions based on projects/skills
5 behavioral questions (STAR method)

RAG-Powered Follow-ups

After user answers question $Q_i$ with response $R_i$:

Query ChromaDB for top-$k$ relevant chunks: $$\text{chunks} = \text{TopK}\left(\text{similarity}(\text{embedding}(R_i), {e_1, \ldots, e_n}), k=3\right)$$
Generate follow-up: $$Q_{i+1} = \text{LLM}(Q_i, R_i, \text{chunks})$$

This ensures follow-ups are contextually grounded in the candidate's actual experience.

ASL Recognition Pipeline

Hand Landmark Extraction:

MediaPipe detects 21 hand landmarks per frame. For landmark $j$:

$$\mathbf{p}_j = (x_j, y_j, z_j) \in \mathbb{R}^3$$

Feature vector for one frame:

$$\mathbf{f} = [x_1, y_1, z_1, x_2, y_2, z_2, \ldots, x_{21}, y_{21}, z_{21}] \in \mathbb{R}^{63}$$

Sign Classification:

We trained a RandomForest classifier on 87,000 images from the Kaggle ASL Alphabet dataset

Text-to-Speech

ElevenLabs API with model selection:

Spanish: eleven_multilingual_v2
English: eleven_turbo_v2

Average latency: $\mu = 1.8$ seconds, $\sigma = 0.4$ seconds

Tech Stack Summary

Layer	Technology
Frontend	Next.js 15, React 19, TypeScript, Tailwind CSS, Framer Motion
Backend	FastAPI, Python 3.13, Uvicorn
LLM	Groq — LLaMA 3.3 70B (via OpenRouter-compatible client)
RAG	ChromaDB (in-memory), MiniLM embeddings, pdfplumber
TTS	ElevenLabs (`eleven_multilingual_v2` for Spanish, `eleven_turbo_v2` for English)
Speech Input	Web Speech API (`en-US` / `es-ES`)
ASL	MediaPipe Hand Landmarks + scikit-learn RandomForest classifier
Auth & DB	Supabase (PostgreSQL + Auth)
3D / Visuals	React Three Fiber, Three.js

Challenges we ran into

1. ASL Recognition Accuracy

Problem: Initial attempts with pre-trained models failed — they couldn't handle real-world lighting, hand angles, and signing speed.

Solution: We trained a custom RandomForest on 87,000 images, then built a letter-buffering system with temporal smoothing

2. RAG Context Retrieval

Problem: Early versions retrieved irrelevant chunks (e.g., asking about Java when the answer mentioned JavaScript).

Mathematical formulation of the issue:

Given query $q$ and chunks ${c_1, \ldots, c_n}$, naive cosine similarity:

$$\text{score}(q, c_i) = \frac{\text{embedding}(q) \cdot \text{embedding}(c_i)}{|\text{embedding}(q)| |\text{embedding}(c_i)|}$$

was retrieving chunks with high lexical overlap but low semantic relevance.

Solution: We improved chunking strategy — splitting by resume section instead of fixed-size windows (previously $|\text{chunk}| = 512$ tokens) — and added a semantic similarity threshold $\tau = 0.65$:

$$\text{retrieved} = {c_i \mid \text{score}(q, c_i) > \tau}$$

3. Multilingual TTS Latency

Problem: ElevenLabs API calls had latency $\mu \approx 2.5$ seconds, breaking conversation flow.

Solution:

Implemented audio streaming (start playback before full response is generated)
Preloaded common phrases ("Great answer", "Let's move on")
Reduced latency to $\mu \approx 1.2$ seconds

4. Web Speech API Reliability

Problem: SpeechRecognition would randomly stop listening in Chrome after $t \sim 30$ seconds.

Solution: Implemented exponential backoff retries with maximum wait time:

$$\text{retry_delay} = \min(2^n \cdot 100\text{ms}, 5000\text{ms})$$

where $n$ is the retry attempt number. Added visual indicator so users know when the mic is active.

5. Resume PDF Parsing Edge Cases

Problem: Some resumes had tables, multi-column layouts, or scanned images. pdfplumber extraction quality varied significantly.

Solution: Added fallback logic:

if pdfplumber fails:
    OCR with pytesseract
    Clean with regex (remove artifacts, normalize whitespace)

Success rate improved from 78% to 96% across diverse resume formats.

6. ChromaDB In-Memory Limitations

Problem: Vectorstore resets between sessions. For $n$ users, we need $O(n)$ storage that persists.

Temporary solution: For hackathon demo, we optimized cold-start time by caching embeddings in Supabase. Average initialization time: $t_{\text{init}} = 1.2$ seconds.

Production plan: Migrate to disk-backed ChromaDB or Pinecone for persistent vector storage.

Accomplishments that we're proud of

First ASL-powered technical interview platform — we couldn't find any existing tool that supports sign language for technical interviews
Resume-aware question generation that actually reads your CV instead of asking generic questions
Full Spanish localization — questions, follow-ups, and TTS all work end-to-end in Spanish
RAG pipeline with similarity threshold $\tau = 0.65$ that makes follow-ups feel natural and contextual, not robotic
Clean UI/UX with smooth transitions, 3D backgrounds (React Three Fiber), and accessibility-first design

What we learned

ASL is harder than we thought

We initially assumed we could use a pre-trained model, but the variability in hand shapes, lighting, and motion required custom training and buffering logic. Key insight: temporal smoothing is essential:

$$\text{stability} \propto \text{window_size}$$

We found optimal window $w^* = 15$ frames (empirically tested $w \in {5, 10, 15, 20, 30}$).

RAG chunk quality matters more than model size

We spent hours tweaking our chunking strategy (section-aware vs. fixed-size) and realized that:

$$\text{Performance} \propto \text{Chunk Quality} \times \text{Model Capacity}$$

Better chunks + smaller model (LLaMA 3.3 70B) outperformed worse chunks + GPT-4.

Latency kills conversational AI

Every $\Delta t = 500$ms delay in TTS or transcription makes the interview feel robotic. We optimized aggressively:

Streaming audio (start playback at $t_{\text{first_byte}}$ instead of $t_{\text{complete}}$)
Prefetching common responses
Reducing API round-trips from $O(n)$ to $O(1)$ per question

Accessibility is not an add-on

Building ASL support from day one shaped our entire architecture. It forced us to think about:

Visual feedback (users need to see when the system is "listening")
Alternative input methods (voice, sign, text)
Internationalization (not just translating strings, but rethinking UX flow)

FastAPI + Next.js is a killer combo

FastAPI's async support made handling $f = 10$ FPS ASL frames trivial
Next.js 15's server actions simplified auth and API calls
Total backend latency for ASL frame processing: $\mu = 45$ms (MediaPipe) + $\mu = 12$ms (RandomForest inference)

What's next for Smart Interview

1. Expand ASL Vocabulary

Current: Alphabet only ($|\Sigma| = 26$ letters)

Goal: Train models for common tech terms as single signs:

$$\text{Sign}(\text{"array"}) \neq \text{Sign}(A) + \text{Sign}(R) + \text{Sign}(R) + \text{Sign}(A) + \text{Sign}(Y)$$

Target vocabulary: $|\Sigma_{\text{tech}}| \approx 500$ terms (function, loop, class, database, API, etc.)

2. Speech-to-Sign TTS

Problem: ASL users currently read questions as text.

Solution: Generate sign language videos (avatar or human) for questions. Two approaches:

Rule-based: Map sentence → gloss notation → avatar animation
Neural: Train seq2seq model $\text{English} \to \text{Sign Sequence}$

3. Persistent Sessions

Schema design:

CREATE TABLE interview_sessions (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  questions JSONB,
  transcripts JSONB,
  scores JSONB,
  created_at TIMESTAMP
);

Users can review past interviews and track improvement: $\text{score}(t) = f(t)$ where $f$ is fitted trend line.

4. Live Feedback Scoring

Use an LLM evaluator to score answers in real-time:

$$\text{Score}{\text{behavioral}} = \text{LLM}(\text{answer}, \text{rubric}{\text{STAR}})$$

$$\text{Score}{\text{technical}} = \text{LLM}(\text{answer}, \text{rubric}{\text{correctness}})$$

Provide suggestions: "Your answer lacks the 'Result' component of STAR."

5. More Languages

Add support for:

Mandarin (zh-CN)
Hindi (hi-IN)
Arabic (ar-SA)

Challenge: TTS models with low latency for tonal languages (Mandarin: 4 tones $\times$ phonemes).

6. Code Execution

Feature: Monaco editor in-browser + test case runner.

Flow:

User writes code: $C(x)$
Run against test cases: ${(x_1, y_1), \ldots, (x_k, y_k)}$
Evaluate: $\text{Passed} = \sum_{i=1}^k \mathbb{1}[C(x_i) = y_i]$

Similar to LeetCode but with personalized questions from resume.

7. Collaborative Interviews

Support pair programming scenarios:

Two users $(U_1, U_2)$ share same session
Real-time code collaboration (CRDT-based sync)
Useful for team fit assessments

8. Mobile App

Constraints:

Camera access: iOS requires AVCaptureSession, Android requires Camera2 API
Offline ASL recognition: Bundle TFLite model ($\sim 15$ MB) for on-device inference
Latency target: $\mu < 100$ms (currently $\mu = 157$ms on web)

Built for HackUSF 2026 🚀