feat(parse): implement Whisper ASR integration for audio parser by mvanhorn · Pull Request #805 · volcengine/OpenViking

mvanhorn · 2026-03-20T05:05:17Z

Summary

Completes the audio parser stub by implementing the TODO items at audio.py:172 and audio.py:190, wiring up OpenAI Whisper API for speech-to-text transcription.

Changes

In openviking/parse/parsers/media/audio.py:

_asr_transcribe(): Calls client.audio.transcriptions.create() with the model from AudioConfig.transcription_model (default: whisper-large-v3). Uses asyncio.get_event_loop().run_in_executor() for async wrapping of the sync OpenAI SDK call.
_asr_transcribe_with_timestamps(): Uses response_format="verbose_json" with timestamp_granularities=["segment"]. Formats segments as **[MM:SS - MM:SS]** text markdown.
parse_content(): Decodes base64 content and delegates to the existing parse() method instead of raising NotImplementedError.

Why this matters

The AudioParser class, config, metadata extraction, and output format were already defined (audio.py). The two ASR methods were placeholder stubs with explicit TODOs. This PR wires up the actual API calls to complete the implementation.

Related: #372 (multimodal resource parsing), #695 (multimodal embedding - audio parser is a prerequisite).

Patterns followed

OpenAI SDK usage from openai_embedders.py
Temp file + cleanup pattern with finally blocks
Logger from openviking_cli.utils.logger
Error handling: catch, log, return fallback (don't crash the parser)

This contribution was developed with AI assistance (Claude Code).

Completes the audio parser stub by wiring up OpenAI Whisper API for speech-to-text transcription: - _asr_transcribe(): calls Whisper API via OpenAI SDK, returns text - _asr_transcribe_with_timestamps(): uses verbose_json format with segment-level timestamps, formatted as **[MM:SS - MM:SS]** markdown - parse_content(): decodes base64 audio and delegates to parse() Follows the existing OpenAI SDK pattern from openai_embedders.py. Uses asyncio executor wrapping for sync API calls. Temp file cleanup in finally blocks. Graceful error handling with logger fallbacks.

…engine#805) * feat(parse): implement Whisper ASR integration for audio parser Completes the audio parser stub by wiring up OpenAI Whisper API for speech-to-text transcription: - _asr_transcribe(): calls Whisper API via OpenAI SDK, returns text - _asr_transcribe_with_timestamps(): uses verbose_json format with segment-level timestamps, formatted as **[MM:SS - MM:SS]** markdown - parse_content(): decodes base64 audio and delegates to parse() Follows the existing OpenAI SDK pattern from openai_embedders.py. Uses asyncio executor wrapping for sync API calls. Temp file cleanup in finally blocks. Graceful error handling with logger fallbacks. * style: auto-format audio.py with ruff --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

mvanhorn · 2026-03-21T03:39:36Z

Thanks for the reviews across these three PRs, @qin-ctx. The feedback on the audio parser especially helped tighten the implementation.

Replace the _ocr_extract() stub with a working Tesseract integration via pytesseract. Uses asyncio.run_in_executor() for the synchronous pytesseract call, matching the pattern from _asr_transcribe() in the audio parser (PR volcengine#805). Gracefully degrades when pytesseract is not installed by returning None with a warning. Added as optional dependency: pip install openviking[ocr] Relates to volcengine#372

Replace the _ocr_extract() stub with a working Tesseract integration via pytesseract. Uses asyncio.run_in_executor() for the synchronous pytesseract call, matching the pattern from _asr_transcribe() in the audio parser (PR #805). Gracefully degrades when pytesseract is not installed by returning None with a warning. Added as optional dependency: pip install openviking[ocr] Relates to #372 Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

github-project-automation bot added this to OpenViking project Mar 20, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 20, 2026

style: auto-format audio.py with ruff

e08ce7b

MaojiaSheng approved these changes Mar 20, 2026

View reviewed changes

MaojiaSheng merged commit 7dfebc8 into volcengine:main Mar 20, 2026
6 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 20, 2026

This was referenced Mar 25, 2026

feat(parse): implement OCR text extraction for image parser #942

Merged

feat(parse): implement video key frame extraction with metadata #943

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parse): implement Whisper ASR integration for audio parser#805

feat(parse): implement Whisper ASR integration for audio parser#805
MaojiaSheng merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/audio-parser-whisper-asr

mvanhorn commented Mar 20, 2026

Uh oh!

Uh oh!

mvanhorn commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvanhorn commented Mar 20, 2026

Summary

Changes

Why this matters

Patterns followed

Uh oh!

Uh oh!

mvanhorn commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants