Skip to content

Implement Durable Speech-to-Text Provider Components for golem:stt WIT Interface #30

@jdegoes

Description

@jdegoes

This ticket involves implementing the golem:stt interface for several major speech-to-text (STT) providers. This WIT interface provides a unified abstraction over transcription functionality, enabling developers to interact with a common API regardless of provider differences.

The interface supports batch and streaming transcription, word-level timestamps, speaker diarization, custom vocabularies, and confidence scores. It is designed to gracefully degrade when a provider does not support a particular feature.

This task is to implement the interface as a series of WASM components (WASI 0.23) in Rust, following Golem conventions for component development, durability integration, and structured error handling.

NOTE: The golem:stt interface was designed by analyzing common speech-to-text APIs. However, it's possible it could be better designed. You are welcome to make improvements although you must make your case for why the API should be different.

Target Providers

The following providers are targeted for implementation:

  • Google Cloud Speech-to-Text
    Enterprise-grade, widely adopted STT API with strong support for streaming, timestamps, diarization, and custom phrase hints.

  • Microsoft Azure Speech
    Feature-rich speech service with batch and streaming modes, custom speech models, and comprehensive SDK support.

  • Amazon Transcribe
    A mature and widely adopted AWS service with support for streaming transcription, speaker diarization, custom vocabularies, and content redaction. Ideal for real-time pipelines and tightly integrated with the AWS ecosystem.

  • Deepgram
    Developer-first API offering fast, high-accuracy transcription with support for streaming, word-level timing, diarization, and keyword boosting.

  • OpenAI Whisper (Open Source)
    Hugely popular open-source model known for multilingual accuracy. Does not natively support diarization or word confidence, but can emulate enough to support a degraded implementation of the interface.

Deliverables

Each provider should be implemented as a standalone WASM component, with full test coverage and integration with Golem durability APIs.

Component Artifacts

  • stt-google.wasm
  • stt-azure.wasm
  • stt-aws.wasm
  • stt-deepgram.wasm
  • stt-whisper.wasm

Each must:

  • Implement golem:stt as per the WIT spec
  • Be compilable with cargo component to WASI 0.23
  • Use environment variables for API keys and config (see below)
  • Integrate Golem durability APIs for correct durable execution

Graceful Degradation

The WIT interface uses option<T> and error variants to support degraded implementations:

  • Whisper does not support speaker diarization or word confidence, so these fields can be returned as none
  • Streaming is not available in Whisper, so transcribe-stream should be omitted
  • Custom vocabularies and speech context can be ignored if unsupported
  • Timestamps are supported via WhisperX or local inference tools

Implementations that skip or degrade a feature must still return valid values and document the behavior in the README or docs.

Configuration

Each implementation should support the following environment variables:

Common

  • STT_PROVIDER_ENDPOINT
  • STT_PROVIDER_TIMEOUT (default: 30)
  • STT_PROVIDER_MAX_RETRIES (default: 3)
  • STT_PROVIDER_LOG_LEVEL

Provider-Specific

Google

  • GOOGLE_APPLICATION_CREDENTIALS
  • `GOOGLE_CLOUD_PROJEC

Testing Requirements

Each implementation must be tested against:

  • Basic transcription for common formats (WAV, MP3)
  • Word-level timing and confidence (if available)
  • Speaker diarization (where supported)
  • Streaming transcription (where supported)
  • Error mappings: invalid inputs, rate limits, network errors
  • Edge cases: silence, overlapping speakers, long audio
  • Quota behavior (real or simulated)
  • Integration with Golem durability APIs
package golem:stt@1.0.0;

interface types {
  variant stt-error {
    invalid-audio(string),
    unsupported-format(string),
    unsupported-language(string),
    transcription-failed(string),
    unauthorized(string),
    access-denied(string),
    quota-exceeded(quota-info),
    rate-limited(u32),
    insufficient-credits,
    unsupported-operation(string),
    service-unavailable(string),
    network-error(string),
    internal-error(string),
  }

  record quota-info {
    used: u32,
    limit: u32,
    reset-time: u64,
    unit: quota-unit,
  }

  enum quota-unit {
    seconds,
    requests,
    credits,
  }

  type language-code = string;

  enum audio-format {
    wav,
    mp3,
    flac,
    ogg,
    aac,
    pcm,
  }

  record audio-config {
    format: audio-format,
    sample-rate: option<u32>,
    channels: option<u8>,
  }

  /// Only word-level timing is commonly supported
  enum timing-mark-type {
    word,
  }

  record timing-info {
    start-time-seconds: f32,
    end-time-seconds: option<f32>,
    mark-type: timing-mark-type,
  }

  record word-segment {
    text: string,
    start-time: f32,
    end-time: f32,
    confidence: option<f32>,
    speaker-id: option<string>,
  }

  record transcript-alternative {
    text: string,
    confidence: f32,
    words: list<word-segment>,
  }

  record transcription-metadata {
    duration-seconds: f32,
    audio-size-bytes: u32,
    request-id: string,
    model: option<string>,
    language: language-code,
  }

  record transcription-result {
    alternatives: list<transcript-alternative>,
    metadata: transcription-metadata,
  }
}

interface vocabularies {
  use types::stt-error;

  resource vocabulary {
    get-name: func() -> string;
    get-phrases: func() -> list<string>;
    delete: func() -> result<_, stt-error>;
  }

  create-vocabulary: func(
    name: string,
    phrases: list<string>
  ) -> result<vocabulary, stt-error>;
}

interface languages {
  use types::{language-code, stt-error};

  record language-info {
    code: language-code,
    name: string,
    native-name: string,
  }

  list-languages: func() -> result<list<language-info>, stt-error>;
}

interface transcription {
  use types::{
    audio-config,
    transcription-result,
    stt-error,
    language-code,
    transcript-alternative,
  };
  use vocabularies::vocabulary;

  record transcribe-options {
    enable-timestamps: option<bool>,
    enable-speaker-diarization: option<bool>,
    language: option<language-code>,
    model: option<string>,
    profanity-filter: option<bool>,
    vocabulary: option<borrow<vocabulary>>,
    speech-context: option<list<string>>,
    enable-word-confidence: option<bool>,
    enable-timing-detail: option<bool>,
  }

  transcribe: func(
    audio: list<u8>,
    config: audio-config,
    options: option<transcribe-options>
  ) -> result<transcription-result, stt-error>;

  transcribe-stream: func(
    config: audio-config,
    options: option<transcribe-options>
  ) -> result<transcription-stream, stt-error>;

  resource transcription-stream {
    send-audio: func(chunk: list<u8>) -> result<_, stt-error>;
    finish: func() -> result<_, stt-error>;
    receive-alternative: func() -> result<option<transcript-alternative>, stt-error>;
    close: func();
  }
} 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions