Skip to content

Add speech to text support#85

Merged
vigoo merged 21 commits intogolemcloud:mainfrom
kmatasfp:stt-support
Aug 22, 2025
Merged

Add speech to text support#85
vigoo merged 21 commits intogolemcloud:mainfrom
kmatasfp:stt-support

Conversation

@kmatasfp
Copy link
Copy Markdown
Contributor

@kmatasfp kmatasfp commented Aug 13, 2025

closes #30
/claim #30

Follows ports and adapters pattern meaning only edge knows about wit and wasm, core does not know anything about it, so makes it possible to reuse it for other targets and more importantly makes it unit testable. Every provider has unit test coverage, over 100 unit test total across providers

Also using more idiomatic way to convert from wit types to domain types using From and TryFrom traits instead of conversion.rs module

Made some changes to wit. As discussed with @jdegoes @vigoo we currently cannot support websocket, http2 nor grpc, so we cannot provide streaming functionality, so focused on batch use case and made it possible to transcribe multiple chunks of audio concurrently using transcribe-many. Also make it possible to transcribe multi channel audio, useful for working with call center audio, as these generally contain 2 channels. Also made it to be more user friendly, like making it easy to access full transcription etc.

Here is the changed WIT

package golem:stt@1.0.0;

interface types {
  variant stt-error {
    invalid-audio(string),
    unsupported-format(string),
    unsupported-language(string),
    transcription-failed(string),
    unauthorized(string),
    access-denied(string),
    rate-limited(string),
    insufficient-credits,
    unsupported-operation(string),
    service-unavailable(string),
    network-error(string),
    internal-error(string),
  }

  type language-code = string;

  enum audio-format {
    wav,
    mp3,
    flac,
    ogg,
    aac,
    pcm,
  }

  record audio-config {
    format: audio-format,
    sample-rate: option<u32>,
    channels: option<u8>,
  }

  record timing-info {
    start-time-seconds: f32,
    end-time-seconds: f32,
  }

  record word-segment {
    text: string,
    timing-info: option<timing-info>,
    confidence: option<f32>,
    speaker-id: option<string>,
  }

  record transcription-metadata {
    duration-seconds: f32,
    audio-size-bytes: u32,
    request-id: string,
    model: option<string>,
    language: language-code,
  }

  record transcription-channel {
    id: string,
    transcript: string,
    segments: list<transcription-segment>
  }

  record transcription-segment {
    transcript: string,
    timing-info: option<timing-info>,
    speaker-id: option<string>,
    words: list<word-segment>,
  }

  record transcription-result {
    transcript-metadata: transcription-metadata,
    channels: list<transcription-channel>
  }
}

interface languages {
  use types.{language-code, stt-error};

  record language-info {
    code: language-code,
    name: string,
    native-name: string,
  }

  list-languages: func() -> result<list<language-info>, stt-error>;
}

interface transcription {
  use types.{
    audio-config,
    transcription-result,
    stt-error,
    language-code,
  };

  record phrase {
     value: string,
     boost: option<f32>
  }

  record vocabulary {
     phrases: list<phrase>
  }

  record diarization-options {
    enabled: bool,
    min-speaker-count: option<u32>,
    max-speaker-count: option<u32>,
  }

  record transcribe-options {
    language: option<language-code>,
    model: option<string>,
    profanity-filter: option<bool>,
    vocabulary: option<vocabulary>,
    diarization: option<diarization-options>,
    enable-multi-channel: option<bool>
  }

  record transcription-request {
    request-id: string,
    audio: list<u8>,
    config: audio-config,
    options: option<transcribe-options>
  }

  record failed-transcription {
    request-id: string,
    error: stt-error,
  }

  record multi-transcription-result {
    successes: list<transcription-result>,
    failures: list<failed-transcription>
  }

  transcribe: func(
    request: transcription-request
  ) -> result<transcription-result, stt-error>;

  transcribe-many: func(
    requests: list<transcription-request>
  ) -> result<multi-transcription-result, stt-error>;
}

world stt-library {
    export types;
    export languages;
    export transcription;
}

Currently only DEEPGRAM allows user to set custom endpoint as it is the only self hostable provider

[dev-dependencies]
aws-sigv4 = "1.2.6"
aws-config = { version = "1.5.10", features = ["behavior-version-latest"] }
aws-credential-types = "1.2.1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! I was just wondering, do you think we could give aws-sdk-transcribe a try?

Copy link
Copy Markdown
Contributor Author

@kmatasfp kmatasfp Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tired to use the aws sdk, but there are few caveats why I did not end up using it:

  • cannot deploy the app to golem with debug symbols enabled, the wasm file is over 50mb, golem imposes max file size for wasm files that can be deployed
  • also aws sdk requires Send + Sync everywhere, so needed to do some unsafe impls all over the place
  • still have to provide my own http client that works with wasm target, and make it so it does not copy the body, so not really out of the box solution
  • minor but still relevant, build time using aws-sdk is noticeably slower

Copy link
Copy Markdown
Contributor

@rutikthakre rutikthakre Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m a bit confused now.
For llm/bedrock, you reviewed and upvoted this PR using the Bedrock crate, even though it had the issues you mentioned above — a WASM build size of 86.36 MB and repeated use of unsafe code.

But you downvoted the alternative approach PR #50 that didn’t have these problems and was identical to other provider implementations, saying it wasn’t good for code maintainability. Now you seem to be using the same approach as PR #50, which is why I’m even more confused.

In case you don’t remember, here are the PRs I’m referring to:

  1. Using Bedrock crate → feat: integrates Aws bedrock into golem #27
  2. Using Reqwest HTTP client → feat: aws bedrock llm implementation #50

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rutik7066 if there was no PR using the AWS library, your PR may have been accepted since it met the conditions of the bounty. But since there were options to choose from, other factors were considered, like maintainability as you have mentioned.
@kmatasfp has spent a lot of time on his PR and I am sure he considered a lot of options and discovered certain limitations before settling on this approach.

If you think you can get a more maintainable version using the aws sdks then go for it @Rutik7066 , it is an open bounty and the best man will win at the end of the day

Copy link
Copy Markdown
Contributor

@rutikthakre rutikthakre Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @iambenkay, I think there’s a bit of a misunderstanding — I’m not trying to compete with him at all. I’m wrapping up my TTS work, but after seeing his PR, I realized I need to rethink my Polly implementation, so I started discussing it with him.

My earlier question was about your PR and mine for Bedrock. I agree the issue he found with the crate approach is a big concern for production, which is why I’m asking — why did he upvote the crate approach despite the repeated unsafe code and the 86.36 MB WASM size, but not use that same approach here? And why is he now following the approach he previously downvoted? Why weren’t those concerns more important than code maintainability in the Bedrock case, and why here?

I’m just trying to understand the reasoning, not argue. Now I will wait for @kmatasfp response to decide about my Polly implementation. Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rutik7066 Can you point out when/where did I down-vote your solution, only thing I did was review #27?

@kmatasfp kmatasfp marked this pull request as ready for review August 14, 2025 20:24
@@ -0,0 +1,9 @@
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file

// * runtime_path: "wit_bindgen_rt"
// * with "golem:exec/executor@1.0.0" = "golem_exec::golem::exec::executor"
// * with "golem:exec/types@1.0.0" = "golem_exec::golem::exec::types"
// * with "golem:exec/executor@1.0.0" = "golem_exec::golem::exec::executor"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the unrelated bindings.rs changes (wit-bindgen prints these in a nondeterministic order, unfortunately)

@vigoo vigoo merged commit 49d374c into golemcloud:main Aug 22, 2025
7 checks passed
@devdairy699 devdairy699 mentioned this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Durable Speech-to-Text Provider Components for golem:stt WIT Interface

4 participants