Implement Durable Speech-to-Text Provider Components for golem:stt WIT Interface

This ticket involves implementing the `golem:stt` interface for several major speech-to-text (STT) providers. This WIT interface provides a unified abstraction over transcription functionality, enabling developers to interact with a common API regardless of provider differences.

The interface supports batch and streaming transcription, word-level timestamps, speaker diarization, custom vocabularies, and confidence scores. It is designed to gracefully degrade when a provider does not support a particular feature.

This task is to implement the interface as a series of **WASM components** (WASI 0.23) in **Rust**, following Golem conventions for component development, durability integration, and structured error handling.

*NOTE*: The `golem:stt` interface was designed by analyzing common speech-to-text APIs. However, it's possible it could be better designed. You are welcome to make improvements although you must make your case for why the API should be different.

## Target Providers

The following providers are targeted for implementation:

- **Google Cloud Speech-to-Text**  
  Enterprise-grade, widely adopted STT API with strong support for streaming, timestamps, diarization, and custom phrase hints.

- **Microsoft Azure Speech**  
  Feature-rich speech service with batch and streaming modes, custom speech models, and comprehensive SDK support.

- **Amazon Transcribe**  
  A mature and widely adopted AWS service with support for streaming transcription, speaker diarization, custom vocabularies, and content redaction. Ideal for real-time pipelines and tightly integrated with the AWS ecosystem.

- **Deepgram**  
  Developer-first API offering fast, high-accuracy transcription with support for streaming, word-level timing, diarization, and keyword boosting.

- **OpenAI Whisper (Open Source)**  
  Hugely popular open-source model known for multilingual accuracy. Does not natively support diarization or word confidence, but can emulate enough to support a degraded implementation of the interface.

## Deliverables

Each provider should be implemented as a standalone WASM component, with full test coverage and integration with Golem durability APIs.

### Component Artifacts

- `stt-google.wasm`
- `stt-azure.wasm`
- `stt-aws.wasm`
- `stt-deepgram.wasm`
- `stt-whisper.wasm`

Each must:

- Implement `golem:stt` as per the WIT spec
- Be compilable with `cargo component` to WASI 0.23
- Use environment variables for API keys and config (see below)
- Integrate Golem durability APIs for correct durable execution

## Graceful Degradation

The WIT interface uses `option<T>` and error variants to support degraded implementations:

- **Whisper** does not support speaker diarization or word confidence, so these fields can be returned as `none`
- **Streaming** is not available in Whisper, so `transcribe-stream` should be omitted
- **Custom vocabularies** and **speech context** can be ignored if unsupported
- **Timestamps** are supported via WhisperX or local inference tools

Implementations that skip or degrade a feature must still return valid values and document the behavior in the README or docs.

## Configuration

Each implementation should support the following environment variables:

### Common
- `STT_PROVIDER_ENDPOINT`
- `STT_PROVIDER_TIMEOUT` (default: 30)
- `STT_PROVIDER_MAX_RETRIES` (default: 3)
- `STT_PROVIDER_LOG_LEVEL`

### Provider-Specific

#### Google
- `GOOGLE_APPLICATION_CREDENTIALS`
- `GOOGLE_CLOUD_PROJEC

## Testing Requirements

Each implementation must be tested against:

- Basic transcription for common formats (WAV, MP3)
- Word-level timing and confidence (if available)
- Speaker diarization (where supported)
- Streaming transcription (where supported)
- Error mappings: invalid inputs, rate limits, network errors
- Edge cases: silence, overlapping speakers, long audio
- Quota behavior (real or simulated)
- Integration with Golem durability APIs

```wit
package golem:stt@1.0.0;

interface types {
  variant stt-error {
    invalid-audio(string),
    unsupported-format(string),
    unsupported-language(string),
    transcription-failed(string),
    unauthorized(string),
    access-denied(string),
    quota-exceeded(quota-info),
    rate-limited(u32),
    insufficient-credits,
    unsupported-operation(string),
    service-unavailable(string),
    network-error(string),
    internal-error(string),
  }

  record quota-info {
    used: u32,
    limit: u32,
    reset-time: u64,
    unit: quota-unit,
  }

  enum quota-unit {
    seconds,
    requests,
    credits,
  }

  type language-code = string;

  enum audio-format {
    wav,
    mp3,
    flac,
    ogg,
    aac,
    pcm,
  }

  record audio-config {
    format: audio-format,
    sample-rate: option<u32>,
    channels: option<u8>,
  }

  /// Only word-level timing is commonly supported
  enum timing-mark-type {
    word,
  }

  record timing-info {
    start-time-seconds: f32,
    end-time-seconds: option<f32>,
    mark-type: timing-mark-type,
  }

  record word-segment {
    text: string,
    start-time: f32,
    end-time: f32,
    confidence: option<f32>,
    speaker-id: option<string>,
  }

  record transcript-alternative {
    text: string,
    confidence: f32,
    words: list<word-segment>,
  }

  record transcription-metadata {
    duration-seconds: f32,
    audio-size-bytes: u32,
    request-id: string,
    model: option<string>,
    language: language-code,
  }

  record transcription-result {
    alternatives: list<transcript-alternative>,
    metadata: transcription-metadata,
  }
}

interface vocabularies {
  use types::stt-error;

  resource vocabulary {
    get-name: func() -> string;
    get-phrases: func() -> list<string>;
    delete: func() -> result<_, stt-error>;
  }

  create-vocabulary: func(
    name: string,
    phrases: list<string>
  ) -> result<vocabulary, stt-error>;
}

interface languages {
  use types::{language-code, stt-error};

  record language-info {
    code: language-code,
    name: string,
    native-name: string,
  }

  list-languages: func() -> result<list<language-info>, stt-error>;
}

interface transcription {
  use types::{
    audio-config,
    transcription-result,
    stt-error,
    language-code,
    transcript-alternative,
  };
  use vocabularies::vocabulary;

  record transcribe-options {
    enable-timestamps: option<bool>,
    enable-speaker-diarization: option<bool>,
    language: option<language-code>,
    model: option<string>,
    profanity-filter: option<bool>,
    vocabulary: option<borrow<vocabulary>>,
    speech-context: option<list<string>>,
    enable-word-confidence: option<bool>,
    enable-timing-detail: option<bool>,
  }

  transcribe: func(
    audio: list<u8>,
    config: audio-config,
    options: option<transcribe-options>
  ) -> result<transcription-result, stt-error>;

  transcribe-stream: func(
    config: audio-config,
    options: option<transcribe-options>
  ) -> result<transcription-stream, stt-error>;

  resource transcription-stream {
    send-audio: func(chunk: list<u8>) -> result<_, stt-error>;
    finish: func() -> result<_, stt-error>;
    receive-alternative: func() -> result<option<transcript-alternative>, stt-error>;
    close: func();
  }
} 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Durable Speech-to-Text Provider Components for golem:stt WIT Interface #30

Target Providers

Deliverables

Component Artifacts

Graceful Degradation

Configuration

Common

Provider-Specific

Google

Testing Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Durable Speech-to-Text Provider Components for golem:stt WIT Interface #30

Description

Target Providers

Deliverables

Component Artifacts

Graceful Degradation

Configuration

Common

Provider-Specific

Google

Testing Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions