Voice-first corporate training studio powered by Deepgram Voice Agent, Gemini (scenario design + assessment), and Browserbase (web research during creation). Learners practice through continuous voice roleplay scenarios with a single scene background and competency scoring at the end.
npm install
npm run devOpens the Vite client at http://localhost:5173 and the API server at http://localhost:3001.
| Variable | Used for |
|---|---|
DEEPGRAM_API_KEY |
Voice Agent WebSocket, speech-to-text, and TTS |
GEMINI_API_KEY |
Scenario generation, runtime agent brain, image generation, and competency scoring |
GEMINI_MODEL |
Agent brain (gemini-3.1-flash-lite) |
GEMINI_IMAGE_MODEL |
Scene backgrounds and marketing thumbnails |
BROWSERBASE_API_KEY |
Researches topics on the web when creating/editing simulations |
BROWSERBASE_PROJECT_ID |
Optional Browserbase project (inferred from API key if omitted) |
ARIZE_API_KEY |
Required — competency assessment traces to Arize AX (OTLP) |
ARIZE_SPACE_ID |
Required — Arize space ID for trace export |
ARIZE_PROJECT_NAME |
Optional project name in Arize (default: gamestudio-learning) |
- Studio (voice or type) — describe an enterprise training scenario
- Browserbase (creation) — opens a cloud browser and gathers factual excerpts from the web (see below)
- Gemini (creation) — designs a voice roleplay scenario with decision points and generates one scene background from the research
- Training (play) — learner has one continuous spoken conversation with a Deepgram-powered voice agent that roleplays the scenario (see below)
- Gemini + Arize (assessment) — at the natural end of the scenario, the full transcript is scored and optionally exported to Arize (see below)
Deepgram powers all real-time voice and spoken audio in the app. The server never streams raw audio to the client for storage — it mints short-lived tokens and builds agent configs; the browser connects directly to Deepgram via the @deepgram/agents SDK.
Studio (create / edit flow) — On the Studio page, the mic opens a Deepgram Voice Agent session (GET /api/deepgram/token, GET /api/deepgram/config). Deepgram handles speech-to-text and turn detection. Gemini is wired in as the agent’s “think” provider: it interprets what you said and decides which studio tool to call (create_game, edit_game, publish_game, etc.). Spoken replies are rendered client-side through POST /api/deepgram/speak (Aura TTS), not the agent’s built-in speak pipeline.
Training (roleplay flow) — On the Training page, a separate lesson agent config is built per session (POST /api/deepgram/lesson-config). Here Deepgram listens to the learner’s microphone (with end-of-turn detection) while Gemini drives an in-character roleplay: the agent speaks as the scenario character(s), pushes back on weak answers, and walks through generated decision points. When the scenario reaches a natural endpoint, the agent calls a complete_scenario tool. Agent lines are spoken via the same REST TTS endpoint so captions stay in sync with audio.
Where it runs
| Surface | Deepgram role |
|---|---|
| Studio mic | STT + turn taking; Gemini decides studio actions |
| Training mic | STT + turn taking; Gemini roleplays the scenario |
| All spoken lines | Aura TTS via POST /api/deepgram/speak |
If DEEPGRAM_API_KEY is missing, voice mode is disabled but typed Studio chat and non-voice flows still work.
Browserbase is used only during creation and editing, not while a learner is training. When you ask Studio to create or edit a module, the server starts a Browserbase cloud browser session and connects to it with Playwright over CDP (server/browserbase.ts).
The session searches the web for material related to your topic:
- Wikipedia — runs a search, opens the first relevant article, and extracts a text excerpt.
- MDN — same pattern for technical or product-adjacent topics.
Those excerpts (URL, title, and body text) are formatted into research notes and passed to Gemini when generating:
- The scenario outline (setting, persona, opening line, 3–4 decision points with 3–4 choices each)
- Edits to an existing module (
edit_game)
Learners never browse the live web during training — Browserbase is a research step for accurate scenario content, similar to an instructional designer looking up source material before writing a script.
If BROWSERBASE_API_KEY is not set, generation falls back to Gemini’s general knowledge without web research.
Arize receives observability traces for competency assessments, not live voice audio or Browserbase sessions. Scoring itself is done by Gemini in POST /api/learning/evaluate.
When a training session ends:
- The full conversation transcript and scenario decision points are sent to the evaluate endpoint.
- Gemini maps each decision point to what the learner said and assigns quality (strong / adequate / weak).
- A weighted mastery score and verdict are computed and returned to the UI.
- OpenTelemetry spans are emitted to Arize AX via OTLP with OpenInference attributes (
input.value,output.value,openinference.span.kind,session.id).
Trace hierarchy per session:
| Span name | Kind | Purpose |
|---|---|---|
training.session |
CHAIN | Root trace for one assessment |
llm.analyze_decisions |
LLM | Gemini decision mapping |
llm.score_session |
LLM | Gemini mastery scoring |
learning.decision.eval |
LLM | One span per decision point — use these for Q&A evaluator |
llm.analyze_decisions |
LLM | Internal Gemini decision-mapping call |
llm.score_session |
LLM | Internal Gemini scoring call |
training.assessment |
CHAIN | Session summary |
Project filter in Arize: set ARIZE_PROJECT_NAME in .env (default gamestudio-learning) and select that project in the Arize UI.
LLM-as-a-Judge template: use Q&A with scope Span and this filter:
name = 'learning.decision.eval'
(Arize's default preview filter openinference.span.kind = LLM will match these spans.)
| Template variable | Map to attribute |
|---|---|
{input} |
attributes.input.value → JSON field input (reference context + choices) |
{question} |
attributes.evaluation.question |
{output} |
attributes.output.value (learner's spoken answer) |
Do not target llm.analyze_decisions or llm.score_session — those are internal Gemini prompts, not learner Q&A.
Optional: add User Frustration on training.assessment spans using the transcript in input.value.
The assessment overlay in the app is the learner-facing result. Arize is required for competency assessment — traces are flushed on every evaluation so they appear in Arize AX within ~30 seconds.
- Tap the mic on Studio (or use Type mode) → connects to Deepgram Voice Agent or Gemini text chat
- Describe a scenario — agent transcribes/responds and calls studio functions
- Tap again to disconnect (voice mode)
Example commands:
- "Create a sales discovery call simulation for new enterprise AEs"
- "Add a branch for handling pricing objections"
- "Publish this training module"
- "Make a B2B ad for sales enablement leaders"
- Open a module from the Library → single background with voice UI overlay
- The agent opens with the scenario’s in-character opening line (spoken via Deepgram TTS)
- Learner unmutes and responds; the agent roleplays through decision points until a natural endpoint
- After the closing line finishes, the competency assessment overlay appears with score, feedback, transcript download, and restart
| Page | Generated by |
|---|---|
| Training | Browserbase research → scenario outline + one Gemini scene background |
| Published | Course catalog copy + thumbnail |
| Ads | B2B ad hook + vertical image |
Training assets are written to public/trainings/{id}/. Marketing assets are served from /api/assets/:id.