This ticket involves implementing the golem:stt interface for several major speech-to-text (STT) providers. This WIT interface provides a unified abstraction over transcription functionality, enabling developers to interact with a common API regardless of provider differences.
The interface supports batch and streaming transcription, word-level timestamps, speaker diarization, custom vocabularies, and confidence scores. It is designed to gracefully degrade when a provider does not support a particular feature.
This task is to implement the interface as a series of WASM components (WASI 0.23) in Rust, following Golem conventions for component development, durability integration, and structured error handling.
NOTE: The golem:stt interface was designed by analyzing common speech-to-text APIs. However, it's possible it could be better designed. You are welcome to make improvements although you must make your case for why the API should be different.
Target Providers
The following providers are targeted for implementation:
-
Google Cloud Speech-to-Text
Enterprise-grade, widely adopted STT API with strong support for streaming, timestamps, diarization, and custom phrase hints.
-
Microsoft Azure Speech
Feature-rich speech service with batch and streaming modes, custom speech models, and comprehensive SDK support.
-
Amazon Transcribe
A mature and widely adopted AWS service with support for streaming transcription, speaker diarization, custom vocabularies, and content redaction. Ideal for real-time pipelines and tightly integrated with the AWS ecosystem.
-
Deepgram
Developer-first API offering fast, high-accuracy transcription with support for streaming, word-level timing, diarization, and keyword boosting.
-
OpenAI Whisper (Open Source)
Hugely popular open-source model known for multilingual accuracy. Does not natively support diarization or word confidence, but can emulate enough to support a degraded implementation of the interface.
Deliverables
Each provider should be implemented as a standalone WASM component, with full test coverage and integration with Golem durability APIs.
Component Artifacts
stt-google.wasm
stt-azure.wasm
stt-aws.wasm
stt-deepgram.wasm
stt-whisper.wasm
Each must:
- Implement
golem:stt as per the WIT spec
- Be compilable with
cargo component to WASI 0.23
- Use environment variables for API keys and config (see below)
- Integrate Golem durability APIs for correct durable execution
Graceful Degradation
The WIT interface uses option<T> and error variants to support degraded implementations:
- Whisper does not support speaker diarization or word confidence, so these fields can be returned as
none
- Streaming is not available in Whisper, so
transcribe-stream should be omitted
- Custom vocabularies and speech context can be ignored if unsupported
- Timestamps are supported via WhisperX or local inference tools
Implementations that skip or degrade a feature must still return valid values and document the behavior in the README or docs.
Configuration
Each implementation should support the following environment variables:
Common
STT_PROVIDER_ENDPOINT
STT_PROVIDER_TIMEOUT (default: 30)
STT_PROVIDER_MAX_RETRIES (default: 3)
STT_PROVIDER_LOG_LEVEL
Provider-Specific
Google
GOOGLE_APPLICATION_CREDENTIALS
- `GOOGLE_CLOUD_PROJEC
Testing Requirements
Each implementation must be tested against:
- Basic transcription for common formats (WAV, MP3)
- Word-level timing and confidence (if available)
- Speaker diarization (where supported)
- Streaming transcription (where supported)
- Error mappings: invalid inputs, rate limits, network errors
- Edge cases: silence, overlapping speakers, long audio
- Quota behavior (real or simulated)
- Integration with Golem durability APIs
package golem:stt@1.0.0;
interface types {
variant stt-error {
invalid-audio(string),
unsupported-format(string),
unsupported-language(string),
transcription-failed(string),
unauthorized(string),
access-denied(string),
quota-exceeded(quota-info),
rate-limited(u32),
insufficient-credits,
unsupported-operation(string),
service-unavailable(string),
network-error(string),
internal-error(string),
}
record quota-info {
used: u32,
limit: u32,
reset-time: u64,
unit: quota-unit,
}
enum quota-unit {
seconds,
requests,
credits,
}
type language-code = string;
enum audio-format {
wav,
mp3,
flac,
ogg,
aac,
pcm,
}
record audio-config {
format: audio-format,
sample-rate: option<u32>,
channels: option<u8>,
}
/// Only word-level timing is commonly supported
enum timing-mark-type {
word,
}
record timing-info {
start-time-seconds: f32,
end-time-seconds: option<f32>,
mark-type: timing-mark-type,
}
record word-segment {
text: string,
start-time: f32,
end-time: f32,
confidence: option<f32>,
speaker-id: option<string>,
}
record transcript-alternative {
text: string,
confidence: f32,
words: list<word-segment>,
}
record transcription-metadata {
duration-seconds: f32,
audio-size-bytes: u32,
request-id: string,
model: option<string>,
language: language-code,
}
record transcription-result {
alternatives: list<transcript-alternative>,
metadata: transcription-metadata,
}
}
interface vocabularies {
use types::stt-error;
resource vocabulary {
get-name: func() -> string;
get-phrases: func() -> list<string>;
delete: func() -> result<_, stt-error>;
}
create-vocabulary: func(
name: string,
phrases: list<string>
) -> result<vocabulary, stt-error>;
}
interface languages {
use types::{language-code, stt-error};
record language-info {
code: language-code,
name: string,
native-name: string,
}
list-languages: func() -> result<list<language-info>, stt-error>;
}
interface transcription {
use types::{
audio-config,
transcription-result,
stt-error,
language-code,
transcript-alternative,
};
use vocabularies::vocabulary;
record transcribe-options {
enable-timestamps: option<bool>,
enable-speaker-diarization: option<bool>,
language: option<language-code>,
model: option<string>,
profanity-filter: option<bool>,
vocabulary: option<borrow<vocabulary>>,
speech-context: option<list<string>>,
enable-word-confidence: option<bool>,
enable-timing-detail: option<bool>,
}
transcribe: func(
audio: list<u8>,
config: audio-config,
options: option<transcribe-options>
) -> result<transcription-result, stt-error>;
transcribe-stream: func(
config: audio-config,
options: option<transcribe-options>
) -> result<transcription-stream, stt-error>;
resource transcription-stream {
send-audio: func(chunk: list<u8>) -> result<_, stt-error>;
finish: func() -> result<_, stt-error>;
receive-alternative: func() -> result<option<transcript-alternative>, stt-error>;
close: func();
}
}
This ticket involves implementing the
golem:sttinterface for several major speech-to-text (STT) providers. This WIT interface provides a unified abstraction over transcription functionality, enabling developers to interact with a common API regardless of provider differences.The interface supports batch and streaming transcription, word-level timestamps, speaker diarization, custom vocabularies, and confidence scores. It is designed to gracefully degrade when a provider does not support a particular feature.
This task is to implement the interface as a series of WASM components (WASI 0.23) in Rust, following Golem conventions for component development, durability integration, and structured error handling.
NOTE: The
golem:sttinterface was designed by analyzing common speech-to-text APIs. However, it's possible it could be better designed. You are welcome to make improvements although you must make your case for why the API should be different.Target Providers
The following providers are targeted for implementation:
Google Cloud Speech-to-Text
Enterprise-grade, widely adopted STT API with strong support for streaming, timestamps, diarization, and custom phrase hints.
Microsoft Azure Speech
Feature-rich speech service with batch and streaming modes, custom speech models, and comprehensive SDK support.
Amazon Transcribe
A mature and widely adopted AWS service with support for streaming transcription, speaker diarization, custom vocabularies, and content redaction. Ideal for real-time pipelines and tightly integrated with the AWS ecosystem.
Deepgram
Developer-first API offering fast, high-accuracy transcription with support for streaming, word-level timing, diarization, and keyword boosting.
OpenAI Whisper (Open Source)
Hugely popular open-source model known for multilingual accuracy. Does not natively support diarization or word confidence, but can emulate enough to support a degraded implementation of the interface.
Deliverables
Each provider should be implemented as a standalone WASM component, with full test coverage and integration with Golem durability APIs.
Component Artifacts
stt-google.wasmstt-azure.wasmstt-aws.wasmstt-deepgram.wasmstt-whisper.wasmEach must:
golem:sttas per the WIT speccargo componentto WASI 0.23Graceful Degradation
The WIT interface uses
option<T>and error variants to support degraded implementations:nonetranscribe-streamshould be omittedImplementations that skip or degrade a feature must still return valid values and document the behavior in the README or docs.
Configuration
Each implementation should support the following environment variables:
Common
STT_PROVIDER_ENDPOINTSTT_PROVIDER_TIMEOUT(default: 30)STT_PROVIDER_MAX_RETRIES(default: 3)STT_PROVIDER_LOG_LEVELProvider-Specific
Google
GOOGLE_APPLICATION_CREDENTIALSTesting Requirements
Each implementation must be tested against: