fix: add "audio" to openai provider capabilities#12717
fix: add "audio" to openai provider capabilities#12717steipete merged 2 commits intoopenclaw:mainfrom
Conversation
The openai provider implements transcribeAudio via transcribeOpenAiCompatibleAudio (Whisper API), but its capabilities array only declared ["image"]. This caused the media-understanding runner to skip the openai provider when processing inbound audio messages, resulting in raw audio files being passed to agents instead of transcribed text. Fix: Add "audio" to the capabilities array so the runner correctly selects the openai provider for audio transcription. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes media-provider selection for inbound audio by correctly declaring the OpenAI media-understanding provider’s support for audio transcription, ensuring audio attachments are routed to Whisper transcription instead of being passed through as raw files.
Changes:
- Add
"audio"to the OpenAI media-understanding provider’scapabilitiesso the runner can select it for audio inputs.
bfc1ccb to
f92900f
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
|
Landed via temp rebase onto main.
✓ src/media-understanding/providers/index.test.ts (3 tests) 2ms Test Files 1 passed (1)
Checking formatting... ELIFECYCLE Command failed with exit code 2. Thanks @openjay! |
Summary
The openai media-understanding provider implements
transcribeAudioviatranscribeOpenAiCompatibleAudio(Whisper API), but itscapabilitiesarray only declared["image"].This caused the media-understanding runner to skip the openai provider when processing inbound audio messages (e.g., voice messages on Discord/WhatsApp), resulting in raw audio files being passed directly to agents instead of transcribed text.
Fix
Add
"audio"to the openai provider's capabilities array so the runner correctly selects the openai provider for audio transcription when configured withtools.media.audio.Test
Before fix:
Agent received raw
.oggfile path instead of transcribed text.After fix:
Audio messages are transcribed via Whisper API before reaching the agent.
Greptile Overview
Greptile Summary
This PR updates the OpenAI media-understanding provider (
src/media-understanding/providers/openai/index.ts) to declare support for audio by adding"audio"to itscapabilitieslist. This aligns the provider’s declared capabilities with its existingtranscribeAudioimplementation, allowing the media-understanding runner to select the OpenAI provider for inbound audio messages whentools.media.audiois enabled, so audio is transcribed before being passed to agents.Confidence Score: 5/5
transcribeAudio). No behavioral change beyond provider selection logic for audio, and it should unblock intended transcription flow.(2/5) Greptile learns from your feedback when you react with thumbs up/down!