feat(parse): implement audio resource parser with Whisper transcription#707
feat(parse): implement audio resource parser with Whisper transcription#707mvanhorn wants to merge 3 commits intovolcengine:mainfrom
Conversation
Replace the audio parser stub with a working implementation that: - Extracts metadata (duration, sample rate, channels, bitrate) via mutagen - Transcribes speech via Whisper API with timestamped segments - Builds structured ResourceNode tree with L0/L1/L2 content tiers - Falls back to metadata-only output when Whisper is unavailable - Adds mutagen as optional dependency under [audio] extra - Adds audio_summary prompt template for semantic indexing - Includes unit tests with mocked Whisper API and mutagen
qin-ctx
left a comment
There was a problem hiding this comment.
Review Summary
The core audio parser implementation is well-structured with good graceful degradation design. However, there are critical issues with the Whisper API integration that make transcription non-functional under default configuration.
Blocking Issues
- Default model name incompatible with OpenAI Whisper API
- No
base_urlsupport for custom Whisper endpoints audio_summary.yamlprompt template added but never used
Non-blocking
- Duplicate boilerplate across transcription methods
- ~13 files have unrelated ruff formatting changes mixed into this feature PR — consider splitting formatting into a separate PR for easier review and bisect
Note
This PR currently has merge conflicts that need to be resolved.
| audio_file.name = f"audio{ext}" | ||
|
|
||
| response = await client.audio.transcriptions.create( | ||
| model=model or "whisper-1", |
There was a problem hiding this comment.
[Bug] (blocking) The default transcription_model in AudioConfig is "whisper-large-v3", which is a HuggingFace/local model name. OpenAI's Whisper API only accepts "whisper-1". Since model is "whisper-large-v3" (truthy), model or "whisper-1" evaluates to "whisper-large-v3", and the API call will fail with an invalid model error.
The error is silently caught by the except block at line 344, so users get metadata-only output without any clear indication that transcription is broken.
Suggested fix: either change the AudioConfig default to "whisper-1", or add model name mapping logic for different Whisper providers.
| import openai | ||
|
|
||
| client = openai.AsyncOpenAI( | ||
| api_key=config.llm.api_key if hasattr(config, "llm") else None, |
There was a problem hiding this comment.
[Bug] (blocking) openai.AsyncOpenAI is created without a base_url parameter, so it always connects to OpenAI's default endpoint. The project's ProviderConfig supports custom API endpoints via api_base, but this code ignores it entirely. Users with custom Whisper deployments (Azure OpenAI, local Whisper server, etc.) cannot use transcription.
[Suggestion] (non-blocking) This direct client creation is also inconsistent with the project's patterns — ImageParser uses get_openviking_config().vlm for model calls, integrating with the provider infrastructure. Consider either reusing an existing provider client or at least reading api_base from the config to construct the client.
| @@ -0,0 +1,44 @@ | |||
| metadata: | |||
There was a problem hiding this comment.
[Design] (blocking) This prompt template (parsing.audio_summary) is defined but never referenced anywhere in the code. For comparison, ImageParser uses render_prompt("parsing.image_summary", ...) to generate semantic summaries via LLM, but AudioParser._generate_semantic_info() at audio.py:456 does simple string truncation instead of LLM-powered summarization.
Either integrate this template with render_prompt() in _generate_semantic_info (following the ImageParser pattern), or remove this file to avoid shipping dead code.
| Transcript with timestamps in markdown format, or None if not available | ||
| List of segment dicts with keys: start, end, text | ||
| """ | ||
| try: |
There was a problem hiding this comment.
[Design] (non-blocking) _asr_transcribe_with_timestamps (lines 348-406) contains identical boilerplate to _asr_transcribe (lines 309-346): config loading, openai.AsyncOpenAI client creation, and BytesIO wrapping. Consider extracting shared helpers like _get_whisper_client() and _prepare_audio_file() to reduce duplication and make the API integration easier to update.
The audio parser feature is unrelated to memory health stats and belongs in its own PR (volcengine#707). Reverts audio.py to pre-rewrite state, removes the unused audio_summary.yaml template and audio parser tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cation - Change default transcription_model from "whisper-large-v3" (HuggingFace name) to "whisper-1" (OpenAI API compatible) - Add base_url support via ProviderConfig.api_base so custom Whisper deployments (Azure, local server) work correctly - Extract _get_whisper_client() and _prepare_audio_file() helpers to eliminate duplicate boilerplate between _asr_transcribe and _asr_transcribe_with_timestamps - Remove unused audio_summary.yaml template (will integrate with render_prompt in a follow-up if LLM-powered summarization is desired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed all feedback in a6a57ce:
|
|
Addressed all feedback in a6a57ce:
|
The `audio` optional dependency group (`mutagen>=1.47.0`) was left over from the reverted audio parser commit (898cc8e). It belongs in volcengine#707 with the audio parser feature, not in this stats API PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(server): add memory health statistics API endpoints
Add two new API endpoints for querying aggregate memory health:
- GET /stats/memories - global memory stats (counts by category,
hotness distribution, staleness metrics)
- GET /stats/sessions/{id} - per-session extraction statistics
The StatsAggregator reads from existing VikingDB indexes and the
hotness_score function without introducing new storage.
Includes unit tests with mocked VikingDB backend.
* feat(parse): implement audio resource parser with Whisper transcription
Replace the audio parser stub with a working implementation that:
- Extracts metadata (duration, sample rate, channels, bitrate) via mutagen
- Transcribes speech via Whisper API with timestamped segments
- Builds structured ResourceNode tree with L0/L1/L2 content tiers
- Falls back to metadata-only output when Whisper is unavailable
- Adds mutagen as optional dependency under [audio] extra
- Adds audio_summary prompt template for semantic indexing
- Includes unit tests with mocked Whisper API and mutagen
* style: format with ruff
* Revert audio parser changes from this branch
The audio parser feature is unrelated to memory health stats and
belongs in its own PR (#707). Reverts audio.py to pre-rewrite state,
removes the unused audio_summary.yaml template and audio parser tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Address review feedback: fix N+1 query, remove total_vectors, fix error handling
- Replace per-category _query_memories_by_category with single
_query_all_memories call, grouping by category in Python (1 DB
round-trip instead of 8)
- Remove misleading total_vectors field (was identical to
total_memories). Will add real vector count from VikingDB index
stats in a follow-up
- Distinguish KeyError (session not found) from other failures in
stats.py endpoint, returning INTERNAL_ERROR for unexpected exceptions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(stats): align tests with review feedback changes
- Remove total_vectors assertions (field was removed per review)
- Use KeyError instead of Exception for session-not-found test
- Add test_session_internal_error to verify INTERNAL_ERROR for
unexpected exceptions (was previously swallowed as NOT_FOUND)
* fix(deps): remove leftover audio optional dependency
The `audio` optional dependency group (`mutagen>=1.47.0`) was left
over from the reverted audio parser commit (898cc8e). It belongs in
#707 with the audio parser feature, not in this stats API PR.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) * feat(server): add memory health statistics API endpoints Add two new API endpoints for querying aggregate memory health: - GET /stats/memories - global memory stats (counts by category, hotness distribution, staleness metrics) - GET /stats/sessions/{id} - per-session extraction statistics The StatsAggregator reads from existing VikingDB indexes and the hotness_score function without introducing new storage. Includes unit tests with mocked VikingDB backend. * feat(parse): implement audio resource parser with Whisper transcription Replace the audio parser stub with a working implementation that: - Extracts metadata (duration, sample rate, channels, bitrate) via mutagen - Transcribes speech via Whisper API with timestamped segments - Builds structured ResourceNode tree with L0/L1/L2 content tiers - Falls back to metadata-only output when Whisper is unavailable - Adds mutagen as optional dependency under [audio] extra - Adds audio_summary prompt template for semantic indexing - Includes unit tests with mocked Whisper API and mutagen * style: format with ruff * Revert audio parser changes from this branch The audio parser feature is unrelated to memory health stats and belongs in its own PR (volcengine#707). Reverts audio.py to pre-rewrite state, removes the unused audio_summary.yaml template and audio parser tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address review feedback: fix N+1 query, remove total_vectors, fix error handling - Replace per-category _query_memories_by_category with single _query_all_memories call, grouping by category in Python (1 DB round-trip instead of 8) - Remove misleading total_vectors field (was identical to total_memories). Will add real vector count from VikingDB index stats in a follow-up - Distinguish KeyError (session not found) from other failures in stats.py endpoint, returning INTERNAL_ERROR for unexpected exceptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(stats): align tests with review feedback changes - Remove total_vectors assertions (field was removed per review) - Use KeyError instead of Exception for session-not-found test - Add test_session_internal_error to verify INTERNAL_ERROR for unexpected exceptions (was previously swallowed as NOT_FOUND) * fix(deps): remove leftover audio optional dependency The `audio` optional dependency group (`mutagen>=1.47.0`) was left over from the reverted audio parser commit (898cc8e). It belongs in volcengine#707 with the audio parser feature, not in this stats API PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
openviking/parse/parsers/media/audio.pywith a working implementationmutagen(optional dependency)This addresses the audio parser stub and aligns with the Q2 multimodal roadmap mentioned in #372. The TODO comments at
audio.py:172("Integrate with actual ASR API (Whisper, etc.)") andaudio.py:190("Integrate with ASR API") are resolved by this implementation.Changes
openviking/parse/parsers/media/audio.pyopenviking/prompts/templates/parsing/audio_summary.yamltests/unit/parse/test_audio_parser.pypyproject.tomlmutagen>=1.47.0as optional[audio]dependencyDesign
BaseParserinterface andImageParserpatterns exactlymutagen(graceful fallback if not installed)openai.AsyncOpenAIfor Whisper API calls, reusing the existing provider configEvidence
Test plan
pytest tests/unit/parse/test_audio_parser.py -vruff check openviking/parse/parsers/media/audio.pySubmitted via osc-newfeature