Add speech to text support by kmatasfp · Pull Request #85 · golemcloud/golem-ai

kmatasfp · 2025-08-13T07:38:06Z

closes #30
/claim #30

Follows ports and adapters pattern meaning only edge knows about wit and wasm, core does not know anything about it, so makes it possible to reuse it for other targets and more importantly makes it unit testable. Every provider has unit test coverage, over 100 unit test total across providers

Also using more idiomatic way to convert from wit types to domain types using From and TryFrom traits instead of conversion.rs module

Made some changes to wit. As discussed with @jdegoes @vigoo we currently cannot support websocket, http2 nor grpc, so we cannot provide streaming functionality, so focused on batch use case and made it possible to transcribe multiple chunks of audio concurrently using transcribe-many. Also make it possible to transcribe multi channel audio, useful for working with call center audio, as these generally contain 2 channels. Also made it to be more user friendly, like making it easy to access full transcription etc.

Here is the changed WIT

package golem:stt@1.0.0;

interface types {
  variant stt-error {
    invalid-audio(string),
    unsupported-format(string),
    unsupported-language(string),
    transcription-failed(string),
    unauthorized(string),
    access-denied(string),
    rate-limited(string),
    insufficient-credits,
    unsupported-operation(string),
    service-unavailable(string),
    network-error(string),
    internal-error(string),
  }

  type language-code = string;

  enum audio-format {
    wav,
    mp3,
    flac,
    ogg,
    aac,
    pcm,
  }

  record audio-config {
    format: audio-format,
    sample-rate: option<u32>,
    channels: option<u8>,
  }

  record timing-info {
    start-time-seconds: f32,
    end-time-seconds: f32,
  }

  record word-segment {
    text: string,
    timing-info: option<timing-info>,
    confidence: option<f32>,
    speaker-id: option<string>,
  }

  record transcription-metadata {
    duration-seconds: f32,
    audio-size-bytes: u32,
    request-id: string,
    model: option<string>,
    language: language-code,
  }

  record transcription-channel {
    id: string,
    transcript: string,
    segments: list<transcription-segment>
  }

  record transcription-segment {
    transcript: string,
    timing-info: option<timing-info>,
    speaker-id: option<string>,
    words: list<word-segment>,
  }

  record transcription-result {
    transcript-metadata: transcription-metadata,
    channels: list<transcription-channel>
  }
}

interface languages {
  use types.{language-code, stt-error};

  record language-info {
    code: language-code,
    name: string,
    native-name: string,
  }

  list-languages: func() -> result<list<language-info>, stt-error>;
}

interface transcription {
  use types.{
    audio-config,
    transcription-result,
    stt-error,
    language-code,
  };

  record phrase {
     value: string,
     boost: option<f32>
  }

  record vocabulary {
     phrases: list<phrase>
  }

  record diarization-options {
    enabled: bool,
    min-speaker-count: option<u32>,
    max-speaker-count: option<u32>,
  }

  record transcribe-options {
    language: option<language-code>,
    model: option<string>,
    profanity-filter: option<bool>,
    vocabulary: option<vocabulary>,
    diarization: option<diarization-options>,
    enable-multi-channel: option<bool>
  }

  record transcription-request {
    request-id: string,
    audio: list<u8>,
    config: audio-config,
    options: option<transcribe-options>
  }

  record failed-transcription {
    request-id: string,
    error: stt-error,
  }

  record multi-transcription-result {
    successes: list<transcription-result>,
    failures: list<failed-transcription>
  }

  transcribe: func(
    request: transcription-request
  ) -> result<transcription-result, stt-error>;

  transcribe-many: func(
    requests: list<transcription-request>
  ) -> result<multi-transcription-result, stt-error>;
}

world stt-library {
    export types;
    export languages;
    export transcription;
}

Currently only DEEPGRAM allows user to set custom endpoint as it is the only self hostable provider

rutikthakre · 2025-08-13T09:49:59Z

stt/aws/Cargo.toml

+[dev-dependencies]
+aws-sigv4 = "1.2.6"
+aws-config = { version = "1.5.10", features = ["behavior-version-latest"] }
+aws-credential-types = "1.2.1"


Hey there! I was just wondering, do you think we could give aws-sdk-transcribe a try?

I tired to use the aws sdk, but there are few caveats why I did not end up using it:

cannot deploy the app to golem with debug symbols enabled, the wasm file is over 50mb, golem imposes max file size for wasm files that can be deployed

also aws sdk requires Send + Sync everywhere, so needed to do some unsafe impls all over the place

still have to provide my own http client that works with wasm target, and make it so it does not copy the body, so not really out of the box solution

minor but still relevant, build time using aws-sdk is noticeably slower

I’m a bit confused now.
For llm/bedrock, you reviewed and upvoted this PR using the Bedrock crate, even though it had the issues you mentioned above — a WASM build size of 86.36 MB and repeated use of unsafe code.

But you downvoted the alternative approach PR #50 that didn’t have these problems and was identical to other provider implementations, saying it wasn’t good for code maintainability. Now you seem to be using the same approach as PR #50, which is why I’m even more confused.

In case you don’t remember, here are the PRs I’m referring to:

Using Bedrock crate → feat: integrates Aws bedrock into golem #27

Using Reqwest HTTP client → feat: aws bedrock llm implementation #50

@Rutik7066 if there was no PR using the AWS library, your PR may have been accepted since it met the conditions of the bounty. But since there were options to choose from, other factors were considered, like maintainability as you have mentioned.
@kmatasfp has spent a lot of time on his PR and I am sure he considered a lot of options and discovered certain limitations before settling on this approach.

If you think you can get a more maintainable version using the aws sdks then go for it @Rutik7066 , it is an open bounty and the best man will win at the end of the day

Hey @iambenkay, I think there’s a bit of a misunderstanding — I’m not trying to compete with him at all. I’m wrapping up my TTS work, but after seeing his PR, I realized I need to rethink my Polly implementation, so I started discussing it with him.

My earlier question was about your PR and mine for Bedrock. I agree the issue he found with the crate approach is a big concern for production, which is why I’m asking — why did he upvote the crate approach despite the repeated unsafe code and the 86.36 MB WASM size, but not use that same approach here? And why is he now following the approach he previously downvoted? Why weren’t those concerns more important than code maintainability in the Bedrock case, and why here?

I’m just trying to understand the reasoning, not argue. Now I will wait for @kmatasfp response to decide about my Polly implementation. Thanks!

@Rutik7066 Can you point out when/where did I down-vote your solution, only thing I did was review #27?

vigoo · 2025-08-22T14:55:00Z

.zed/settings.json

@@ -0,0 +1,9 @@
+{


Please remove this file

vigoo · 2025-08-22T14:56:16Z

exec/exec/src/bindings.rs

 //   * runtime_path: "wit_bindgen_rt"
-//   * with "golem:exec/executor@1.0.0" = "golem_exec::golem::exec::executor"
 //   * with "golem:exec/types@1.0.0" = "golem_exec::golem::exec::types"
+//   * with "golem:exec/executor@1.0.0" = "golem_exec::golem::exec::executor"


Please revert the unrelated bindings.rs changes (wit-bindgen prints these in a nondeterministic order, unfortunately)

kmatasfp added 2 commits August 13, 2025 00:03

feat: add portable stt support

3dc78f1

feat: support self hosted deepgram

b740c23

algora-pbc bot added the 🙋 Bounty claim label Aug 13, 2025

algora-pbc bot mentioned this pull request Aug 13, 2025

Implement Durable Speech-to-Text Provider Components for golem:stt WIT Interface #30

Closed

kmatasfp added 3 commits August 13, 2025 01:06

chore: rename enum entries

20ea156

chore: remove unneeded into

3077517

chore: fix clippy warnings

34a20aa

rutikthakre reviewed Aug 13, 2025

View reviewed changes

kmatasfp and others added 7 commits August 13, 2025 14:10

chore: address some more clippy warnings

094fc64

wip: zero copy retry logic for wasi http client

e80c5d7

feat: add retry logic using reference counting Bytes

d69aba6

wip: durability

c67055e

chore: fix typo

7f9bff4

chore: improved MultipartBuilder

0bcb690

feat: enable custom durability for all the providers

c6f88b0

kmatasfp marked this pull request as ready for review August 14, 2025 20:24

kmatasfp and others added 6 commits August 14, 2025 14:36

fix: whisper form data

0f34e4e

fix: minor issues discovered during testing

f346e7f

feat: add README

83bd49b

chore: expose request_id in error message

00d6770

chore: minor improvements

26ed59d

stt: set default features to false

59659f0

vigoo reviewed Aug 22, 2025

View reviewed changes

.zed/settings.json Outdated

@@ -0,0 +1,9 @@

{

Copy link
Copy Markdown

Collaborator

vigoo Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file

vigoo reviewed Aug 22, 2025

View reviewed changes

kmatasfp added 3 commits August 22, 2025 08:00

chore: add zed specific settings to .gitignore

fc66569

chore: revert changes made to bindings.rs files in non stt modules

cd7845f

chore: revert binding.rs changes in websearch module

01f7e56

vigoo approved these changes Aug 22, 2025

View reviewed changes

vigoo merged commit 49d374c into golemcloud:main Aug 22, 2025
7 checks passed

devdairy699 mentioned this pull request Nov 10, 2025

feat: tts provider #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add speech to text support#85

Add speech to text support#85
vigoo merged 21 commits intogolemcloud:mainfrom
kmatasfp:stt-support

kmatasfp commented Aug 13, 2025 •

edited

Loading

Uh oh!

rutikthakre Aug 13, 2025

Uh oh!

kmatasfp Aug 13, 2025 •

edited

Loading

Uh oh!

rutikthakre Aug 14, 2025 •

edited

Loading

Uh oh!

iambenkay Aug 14, 2025

Uh oh!

rutikthakre Aug 14, 2025 •

edited

Loading

Uh oh!

kmatasfp Aug 14, 2025

Uh oh!

vigoo Aug 22, 2025

Uh oh!

vigoo Aug 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kmatasfp commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rutikthakre Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

kmatasfp Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rutikthakre Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iambenkay Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

rutikthakre Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmatasfp Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

vigoo Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

vigoo Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kmatasfp commented Aug 13, 2025 •

edited

Loading

kmatasfp Aug 13, 2025 •

edited

Loading

rutikthakre Aug 14, 2025 •

edited

Loading

rutikthakre Aug 14, 2025 •

edited

Loading