Durable video generation (golem:video) by Nanashi-lab · Pull Request #51 · golemcloud/golem-ai

Nanashi-lab · 2025-06-27T10:28:21Z

/closes #44
/claim #44

There are 5 test for all 4 Providers

Text to Video
Image to Video, Also test polling durability, using Raw bytes as input image
Image to Video, uses URL image as input image, also tests role Last for runway
Video to Video, (extend video), Only for veo, uses video from Test 1, passes Raw Bytes, Output in GCS bucket
Video Upscale, (advanced feature), Only for runway, uses a video URL as input.

Runway Test Video
Stability Test Video
Kling Test Video
Veo Test Video

Advanced Test for Kling only, there are 11 Tests here, covering, first and last image, advanced camera config, masking features, lip-sync, extend video, multi image to video.

Kling Advanced Test Video

There is no audio in the video, but since 4 of tests, single effect, lip-sync1, lip-sync2, and lip-sync3 have audio, I have added the actual output files to the end with audio.

Note - All Kling tests have been cut in the middle, as each generation/polling takes 3-7 minutes. Some of the advanced test do multiple generation one after another, example text2video->extend->lip-sync

Official documentation.

Kling
Veo
Runway
Stability

Nanashi-lab · 2025-06-27T11:07:36Z

@jdegoes hi, help me with some clarifications and wit changes. I have also proposed a wit with changes, in the next comment.

Current State of PR (Completed parts)

Image-to-Video: All providers support this natively.
Text-to-Video:
Runway and Stability, lack native support,we do text-to-image generation, followed by an image-to-video generation.
Durability and test component

Wit Changes

Config

enum image-role {
general,
style,
character,
composition,
}

This enum does not align with the any of the api,
Suggested replacement -

[First, Last]. Runway, Veo and Kling all support, specify if the image is first frame or last frame.

record character-consistency {
reference-images: list,
strength: option,
}

record style-consistency {
reference-images: list,
strength: option,
}

This config is from runway text-to-image, since I am doing text-to-image, as part of text-to-video, I can fit this, but feels out of place and better in golem:image. Character consistency and style consistency is maintained by default for all providers.
Suggest replacements -

LastFrame (Kling, can accept both first and last frame)
Multi-Image to video (Kling only, this is a separate endpoint, moved to the bottom)
Advanced Kling camera and mask controls (moved to the bottom)

Minor changes -

Added model to the config,
and optional prompt to images, all providers (except stability) accept prompt as part of image-to-video.
audio input and video input is not supported for video generation.
All generating functions also output -> result<string, video-error> . This passes the error much better, than storing it internally and using a uuid to pass values.

Avatar

record avatar {
id: string,
name: string,
preview: option,
}

This matches with Kling's Lip-sync, maybe they supported avatars in the past, but now kling can do lip-sync on any input video. (polling returns a failed(face-detection) error if no face.)

text: string,
voice-id: option,
background: option
) -> string;

voice-id is match for how Kling supports audio, in speak function it is a choice, either [voice-id, text, speed] or [input audio file], no background audio for either.

Effects

extend-video - Supported by both veo and kling

both style-guide and background removal are for image to image (supported by runway and stability)

Suggested Replacements -

Separate extend-video into a new function and
Runway supports video upscaling

Others

Kling supports "video-effects" it takes one or two images and enum and outputs a video, e.g. two images of people and a "hug" effect to create a video of them hugging
Kling supports Multi-Image to video (upto 4 and a prompt) - This is different from image-to-video, both by endpoint, and what it does, This uses the 4 images to make composite, and uses that starting frame. e.g., an image of a boy, a pegasus, and a castle with the prompt "a boy riding a pegasus in front of a castle"
Kling supports a advanced camera configs (which cannot be neatly fit into provider options) and also supports mask to decide which parts to not animate (dynamic and static)

Template

I did not understand this at all, I could not find any API references, Am I meant to pre-create template, with already existing prompt/image so it can be used as a test ?

I am fairly confident on my proposed changes, as I am familiar with the api now I have implemented text-to-video and image-to-video.

Official documentation.

Kling
Veo
Runway
Stability

Nanashi-lab · 2025-06-27T11:54:25Z

This is my proposed wit, this mirrors the feature available with providers while remaining consistent with the original wit. This doesn't include kling advanced camera and mask options.

package golem:video-generation

interface types {
  variant video-error {
    invalid-input(string),
    unsupported-feature(string),
    quota-exceeded,
    generation-failed(string),
    cancelled,
    internal-error(string),
  }

  variant media-input {
    text(string),
    image(reference),
  }

// Added prompt
  record reference {
    data: input-image,
    prompt: option<string>,
    role: option<image-role>,
  }

// Changed to first and last
  enum image-role {
    first,
    last,
  }

  record input-image {
   data: media-data,
  }
  record base-video {
    data: media-data,
  }

  record narration {
    data: media-data,
  }

  variant media-data {
    url(string),
    bytes(list<u8>),
  }

  record generation-config {
    negative-prompt: option<string>,
    seed: option<u64>,
    scheduler: option<string>,
    guidance-scale: option<f32>,
    aspect-ratio: option<aspect-ratio>,
    duration-seconds: option<f32>,
    resolution: option<resolution>,
    enable-audio: option<bool>,
    enhance-prompt: option<bool>,
    provider-options: list<kv>,
    ///Added model and lastframe (Kling Only)
    model: option<string>,
    lastframe: option<input-image: media-data>,
  }

  enum aspect-ratio {
    square,
    portrait,
    landscape,
    cinema,
  }

  enum resolution {
    sd,
    hd,
    fhd,
    uhd,
  }

  record kv {
    key: string,
    value: string,
  }

    record video {
    uri: option<string>,
    base64-bytes: option<list<u8>>,
    mime-type: string,
    width: option<u32>,
    height: option<u32>,
    fps: option<f32>,
    duration-seconds: option<f32>,
  }

  variant job-status {
    pending,
    running,
    succeeded,
    failed(string),
  }

  record video-result {
    status: job-status,
    videos: option<list<video>>,
    metadata: option<list<kv>>,
  }
}

interface video-generation {
  use types.{media-input, generation-config, video-result, video-error};
  
  // changed output from string to result<string, video-error>
  // easier to pass input-invalid, generation error
  // for all generate func
  generate: func(input: media-input, config: generation-config) -> result<string, video-error>;
  poll: func(job-id: string) -> result<video-result, video-error>;
  cancel: func(job-id: string) -> result<string, video-error>;
}

interface lip-sync {
  use types.{video-error, media-data};

// Define the two possible audio source, using voice-id or input audio
  variant audio-source {
    from-text(text: string, voice-id: option<string>, speed: u32),
    from-audio(narration-audio: media-data),
  }

  generate: func(
    input: (base-video: media-data),
    audio: audio-source,
  ) -> result<string, video-error>;

  record voice-info {
    voice-id: string,
    name: string,
    language: string,
    gender: option<string>,
    preview-url: option<string>,
  }

  list-voices: func(language: option<string>) -> result<list<voice-info>, video-error>;
}

interface advanced {
    use types.{video-error, kv};

    // Supported in Kling and veo
    extend-video: func(
        input: base-video,
        prompt: option<string>,
        duration: option<f32>,
    ) -> result<string, video-error>;

    // Supported in runway
    upscale-video: func(
        input: base-video,
    ) -> result<string, video-error>;

    // Supported in kling only
    video-effects: func(
        input: input-image,
        second-image: option<input-image>,
        effect: string,
    ) -> result<string, video-error>;
    
    // Multi image generation, kling Only
    multi-image-generation: func(
        input: input-image,
        other-images: list<input-image>, //Upto max 3 more
        config: generation-config,
    ) -> result<string, video-error>;
}

// I have left this as is, I would like a clarification for this
// I also dont get why no introspection
interface templates {
  use types.{video-error, kv};
  generate-from-template: func(
    template-id: string,
    variables: list<kv>
  ) -> string;
}

world video-generation {
  import types;
  import video-generation;
  import lip-sync;
  import advanced;
  import templates;

  export api: video-generation;
  export lip-sync;
  export template-videos: templates;
  export video-effects: effects;
}

jdegoes · 2025-06-27T16:50:11Z

@Nanashi-lab

I did not spend much time on this WIT so I am glad you took a closer look.

I like your proposed revisions and would suggest a few more:

Delete more strings, e.g. voice-info.gender, video-effect(effect: string). You can use enum or variant to encode the information much more precisely, in a way that is not "stringly-typed".
Instead of having job-id (a pattern I used earlier), use a resource for the job so the user doesn't have to pass stringly-typed information
Delete templates, it seems useless to me, same thing can be done in user-land

Nanashi-lab · 2025-07-17T17:53:57Z

Changes to the wit since last time -

Left job-id as string rather than resource, I wasn't sure how resource would look like, and if it can be passed, even in testing I was polling with job-id as string input, rather than regenerating whole request.
Extend Video, both Veo and Kling support extend, Veo supports it as part of image to video, by taking video as input here, must be veo generated video.
Kling also supports video extension, but using video-id, which is a global identifier for each of the video generated by Kling, so I have separated out both extend.
Veo Extend is part of the generate video function, taking video as input
While Extend function in advanced is used for Kling, which take video-id, and outputs extended video
Since we are using video-id, Lip-sync which supports both video input and video-id, now also accepts video-id

There were other minor changes, added advanced camera-config (always the plan), made provider options optional, rather than passing empty KV, minor stuff in the advanced features to fit the api better. Multi-image is just list<image>, rather than, image and list<image>.

@jdegoes Thanks again for assigning this to me, this was a lot of fun. If golem:image is a possible bounty, I would love to be assigned to this. I can do my own research, make wit and confirm and then implement.

mschuwalow · 2025-07-18T13:52:22Z

@Nanashi-lab could you resolve the conflicts please

Nanashi-lab · 2025-07-21T02:42:32Z

Removed all the minor changes, will make a seperate PR for that

There are also some minor unrelated edits, changed build-all and release-build-all back to manually copying file into components, as method in latest does not work.

Made subfolder is test for llm, video, video-advanced golem app.

Fixed the ollama integeration test BUG: Ollama test suite in CI is a false positive. It succeeds even though the tests don't run to completion #58

Split the contents in README, test parts go in test, video part goes to video and llm parts goes to llm.

Move build-test-components from each of the individual folder, to main makefile.toml

mschuwalow

Some minor cleanup, looks good to me otherwise!

test-video/golem.yaml

video/video/src/durability.rs

mschuwalow · 2025-07-24T10:47:21Z

video/stability/src/client.rs

+            Ok(PollResponse::Processing)
+        } else if status.is_success() {
+            // 200 - Complete, get video data
+            let video_bytes = response


Reading the entire video into a byte array into an array in memory is not great. (As they can be very large and workers should avoid using too much memory).

A better option would have been to streamingly read and streamingly write to e.g. the blob storage. You could then return a blob storage handle to the caller. We didn't spec the wit that way though, so fine to leave as is.

Moved video and video-advanced to test project

Replaced infallible with persist and replay

Refactored Logging into a single function in durability (similar to noise's PR)

There is a Bug in test/video-advanced it now needs cargo-component 0.21.1 or errors (not sure why)

I also have a README.md for video parts of the file, which is not part of this PR.

llm/grok/src/bindings.rs

jdegoes · 2025-07-25T21:15:21Z

@Nanashi-lab

Thanks again for assigning this to me, this was a lot of fun. If golem:image is a possible bounty, I would love to be assigned to this. I can do my own research, make wit and confirm and then implement.

Please write up a ticket, and if I like it, I'll attach a bounty and give you a 3 week exclusive!

Nanashi-lab marked this pull request as draft June 27, 2025 11:25

algora-pbc bot added the 🙋 Bounty claim label Jul 10, 2025

algora-pbc bot mentioned this pull request Jul 10, 2025

Implement Durable Video Generation for Multiple Providers (golem:video-generation) #44

Closed

Nanashi-lab changed the title ~~Durable video generation (golem:video-generation)~~ Durable video generation (golem:video) Jul 10, 2025

Nanashi-lab marked this pull request as ready for review July 17, 2025 14:52

Nanashi-lab force-pushed the video-generation branch from 6b574fe to 48a05ff Compare July 17, 2025 17:01

Nanashi-lab force-pushed the video-generation branch from 48a05ff to f50f995 Compare July 21, 2025 02:37

mschuwalow requested changes Jul 24, 2025

View reviewed changes

Nanashi-lab force-pushed the video-generation branch from 3219bd9 to e4e15c7 Compare July 24, 2025 17:08

Nanashi-lab added 2 commits July 25, 2025 09:09

First

c513a42

use cargo-component0.21.1 for video-advanced

64c40c1

Nanashi-lab force-pushed the video-generation branch from abaee09 to 64c40c1 Compare July 25, 2025 04:11

noise64 reviewed Jul 25, 2025

View reviewed changes

llm/grok/src/bindings.rs Outdated Show resolved Hide resolved

move to cargo-component@0.20.1

10caaca

mschuwalow approved these changes Jul 25, 2025

View reviewed changes

mschuwalow merged commit 9ac145d into golemcloud:main Jul 25, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable video generation (golem:video)#51

Durable video generation (golem:video)#51
mschuwalow merged 3 commits intogolemcloud:mainfrom
Nanashi-lab:video-generation

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Uh oh!

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Uh oh!

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Uh oh!

jdegoes commented Jun 27, 2025

Uh oh!

Nanashi-lab commented Jul 17, 2025 •

edited

Loading

Uh oh!

mschuwalow commented Jul 18, 2025

Uh oh!

Nanashi-lab commented Jul 21, 2025

Uh oh!

mschuwalow left a comment

Uh oh!

Uh oh!

Uh oh!

mschuwalow Jul 24, 2025

Uh oh!

Nanashi-lab Jul 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jdegoes commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Nanashi-lab commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nanashi-lab commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current State of PR (Completed parts)

Wit Changes

Uh oh!

Nanashi-lab commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdegoes commented Jun 27, 2025

Uh oh!

Nanashi-lab commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mschuwalow commented Jul 18, 2025

Uh oh!

Nanashi-lab commented Jul 21, 2025

Uh oh!

mschuwalow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mschuwalow Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Nanashi-lab Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jdegoes commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Nanashi-lab commented Jun 27, 2025 •

edited

Loading

Nanashi-lab commented Jul 17, 2025 •

edited

Loading

Nanashi-lab Jul 25, 2025 •

edited

Loading