Skip to content

feat(parse): implement Whisper ASR integration for audio parser#805

Merged
MaojiaSheng merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/audio-parser-whisper-asr
Mar 20, 2026
Merged

feat(parse): implement Whisper ASR integration for audio parser#805
MaojiaSheng merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/audio-parser-whisper-asr

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

Completes the audio parser stub by implementing the TODO items at audio.py:172 and audio.py:190, wiring up OpenAI Whisper API for speech-to-text transcription.

Changes

In openviking/parse/parsers/media/audio.py:

  • _asr_transcribe(): Calls client.audio.transcriptions.create() with the model from AudioConfig.transcription_model (default: whisper-large-v3). Uses asyncio.get_event_loop().run_in_executor() for async wrapping of the sync OpenAI SDK call.
  • _asr_transcribe_with_timestamps(): Uses response_format="verbose_json" with timestamp_granularities=["segment"]. Formats segments as **[MM:SS - MM:SS]** text markdown.
  • parse_content(): Decodes base64 content and delegates to the existing parse() method instead of raising NotImplementedError.

Why this matters

The AudioParser class, config, metadata extraction, and output format were already defined (audio.py). The two ASR methods were placeholder stubs with explicit TODOs. This PR wires up the actual API calls to complete the implementation.

Related: #372 (multimodal resource parsing), #695 (multimodal embedding - audio parser is a prerequisite).

Patterns followed

  • OpenAI SDK usage from openai_embedders.py
  • Temp file + cleanup pattern with finally blocks
  • Logger from openviking_cli.utils.logger
  • Error handling: catch, log, return fallback (don't crash the parser)

This contribution was developed with AI assistance (Claude Code).

Completes the audio parser stub by wiring up OpenAI Whisper API for
speech-to-text transcription:

- _asr_transcribe(): calls Whisper API via OpenAI SDK, returns text
- _asr_transcribe_with_timestamps(): uses verbose_json format with
  segment-level timestamps, formatted as **[MM:SS - MM:SS]** markdown
- parse_content(): decodes base64 audio and delegates to parse()

Follows the existing OpenAI SDK pattern from openai_embedders.py.
Uses asyncio executor wrapping for sync API calls. Temp file cleanup
in finally blocks. Graceful error handling with logger fallbacks.
@MaojiaSheng MaojiaSheng merged commit 7dfebc8 into volcengine:main Mar 20, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 20, 2026
zeattacker pushed a commit to zeattacker/OpenViking that referenced this pull request Mar 20, 2026
…engine#805)

* feat(parse): implement Whisper ASR integration for audio parser

Completes the audio parser stub by wiring up OpenAI Whisper API for
speech-to-text transcription:

- _asr_transcribe(): calls Whisper API via OpenAI SDK, returns text
- _asr_transcribe_with_timestamps(): uses verbose_json format with
  segment-level timestamps, formatted as **[MM:SS - MM:SS]** markdown
- parse_content(): decodes base64 audio and delegates to parse()

Follows the existing OpenAI SDK pattern from openai_embedders.py.
Uses asyncio executor wrapping for sync API calls. Temp file cleanup
in finally blocks. Graceful error handling with logger fallbacks.

* style: auto-format audio.py with ruff

---------

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews across these three PRs, @qin-ctx. The feedback on the audio parser especially helped tighten the implementation.

mvanhorn added a commit to mvanhorn/OpenViking that referenced this pull request Mar 25, 2026
Replace the _ocr_extract() stub with a working Tesseract integration
via pytesseract. Uses asyncio.run_in_executor() for the synchronous
pytesseract call, matching the pattern from _asr_transcribe() in the
audio parser (PR volcengine#805).

Gracefully degrades when pytesseract is not installed by returning None
with a warning. Added as optional dependency: pip install openviking[ocr]

Relates to volcengine#372
MaojiaSheng pushed a commit that referenced this pull request Mar 27, 2026
Replace the _ocr_extract() stub with a working Tesseract integration
via pytesseract. Uses asyncio.run_in_executor() for the synchronous
pytesseract call, matching the pattern from _asr_transcribe() in the
audio parser (PR #805).

Gracefully degrades when pytesseract is not installed by returning None
with a warning. Added as optional dependency: pip install openviking[ocr]

Relates to #372

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants