Skip to content

Durable video generation (golem:video)#51

Merged
mschuwalow merged 3 commits intogolemcloud:mainfrom
Nanashi-lab:video-generation
Jul 25, 2025
Merged

Durable video generation (golem:video)#51
mschuwalow merged 3 commits intogolemcloud:mainfrom
Nanashi-lab:video-generation

Conversation

@Nanashi-lab
Copy link
Copy Markdown
Contributor

@Nanashi-lab Nanashi-lab commented Jun 27, 2025

/closes #44
/claim #44

There are 5 test for all 4 Providers

  1. Text to Video
  2. Image to Video, Also test polling durability, using Raw bytes as input image
  3. Image to Video, uses URL image as input image, also tests role Last for runway
  4. Video to Video, (extend video), Only for veo, uses video from Test 1, passes Raw Bytes, Output in GCS bucket
  5. Video Upscale, (advanced feature), Only for runway, uses a video URL as input.

Runway Test Video
Stability Test Video
Kling Test Video
Veo Test Video

Advanced Test for Kling only, there are 11 Tests here, covering, first and last image, advanced camera config, masking features, lip-sync, extend video, multi image to video.

Kling Advanced Test Video

There is no audio in the video, but since 4 of tests, single effect, lip-sync1, lip-sync2, and lip-sync3 have audio, I have added the actual output files to the end with audio.

Note - All Kling tests have been cut in the middle, as each generation/polling takes 3-7 minutes. Some of the advanced test do multiple generation one after another, example text2video->extend->lip-sync

Official documentation.

Kling
Veo
Runway
Stability

@Nanashi-lab
Copy link
Copy Markdown
Contributor Author

Nanashi-lab commented Jun 27, 2025

@jdegoes hi, help me with some clarifications and wit changes. I have also proposed a wit with changes, in the next comment.

Current State of PR (Completed parts)

  • Image-to-Video: All providers support this natively.
  • Text-to-Video:
    Runway and Stability, lack native support,we do text-to-image generation, followed by an image-to-video generation.
  • Durability and test component

Wit Changes

Config

enum image-role {
general,
style,
character,
composition,
}

This enum does not align with the any of the api,
Suggested replacement -

  1. [First, Last]. Runway, Veo and Kling all support, specify if the image is first frame or last frame.

record character-consistency {
reference-images: list,
strength: option,
}

record style-consistency {
reference-images: list,
strength: option,
}

This config is from runway text-to-image, since I am doing text-to-image, as part of text-to-video, I can fit this, but feels out of place and better in golem:image. Character consistency and style consistency is maintained by default for all providers.
Suggest replacements -

  1. LastFrame (Kling, can accept both first and last frame)
  2. Multi-Image to video (Kling only, this is a separate endpoint, moved to the bottom)
  3. Advanced Kling camera and mask controls (moved to the bottom)

Minor changes -

  • Added model to the config,
  • and optional prompt to images, all providers (except stability) accept prompt as part of image-to-video.
  • audio input and video input is not supported for video generation.
  • All generating functions also output -> result<string, video-error> . This passes the error much better, than storing it internally and using a uuid to pass values.

Avatar

record avatar {
id: string,
name: string,
preview: option,
}

This matches with Kling's Lip-sync, maybe they supported avatars in the past, but now kling can do lip-sync on any input video. (polling returns a failed(face-detection) error if no face.)

text: string,
voice-id: option,
background: option
) -> string;

voice-id is match for how Kling supports audio, in speak function it is a choice, either [voice-id, text, speed] or [input audio file], no background audio for either.


Effects

  1. extend-video - Supported by both veo and kling

both style-guide and background removal are for image to image (supported by runway and stability)

Suggested Replacements -

  1. Separate extend-video into a new function and
  2. Runway supports video upscaling

Others

  • Kling supports "video-effects" it takes one or two images and enum and outputs a video, e.g. two images of people and a "hug" effect to create a video of them hugging
  • Kling supports Multi-Image to video (upto 4 and a prompt) - This is different from image-to-video, both by endpoint, and what it does, This uses the 4 images to make composite, and uses that starting frame. e.g., an image of a boy, a pegasus, and a castle with the prompt "a boy riding a pegasus in front of a castle"
  • Kling supports a advanced camera configs (which cannot be neatly fit into provider options) and also supports mask to decide which parts to not animate (dynamic and static)

Template

I did not understand this at all, I could not find any API references, Am I meant to pre-create template, with already existing prompt/image so it can be used as a test ?


I am fairly confident on my proposed changes, as I am familiar with the api now I have implemented text-to-video and image-to-video.

Official documentation.

Kling
Veo
Runway
Stability

@Nanashi-lab Nanashi-lab marked this pull request as draft June 27, 2025 11:25
@Nanashi-lab
Copy link
Copy Markdown
Contributor Author

Nanashi-lab commented Jun 27, 2025

This is my proposed wit, this mirrors the feature available with providers while remaining consistent with the original wit. This doesn't include kling advanced camera and mask options.

package golem:video-generation

interface types {
  variant video-error {
    invalid-input(string),
    unsupported-feature(string),
    quota-exceeded,
    generation-failed(string),
    cancelled,
    internal-error(string),
  }

  variant media-input {
    text(string),
    image(reference),
  }

// Added prompt
  record reference {
    data: input-image,
    prompt: option<string>,
    role: option<image-role>,
  }

// Changed to first and last
  enum image-role {
    first,
    last,
  }

  record input-image {
   data: media-data,
  }
  record base-video {
    data: media-data,
  }

  record narration {
    data: media-data,
  }

  variant media-data {
    url(string),
    bytes(list<u8>),
  }

  record generation-config {
    negative-prompt: option<string>,
    seed: option<u64>,
    scheduler: option<string>,
    guidance-scale: option<f32>,
    aspect-ratio: option<aspect-ratio>,
    duration-seconds: option<f32>,
    resolution: option<resolution>,
    enable-audio: option<bool>,
    enhance-prompt: option<bool>,
    provider-options: list<kv>,
    ///Added model and lastframe (Kling Only)
    model: option<string>,
    lastframe: option<input-image: media-data>,
  }

  enum aspect-ratio {
    square,
    portrait,
    landscape,
    cinema,
  }

  enum resolution {
    sd,
    hd,
    fhd,
    uhd,
  }

  record kv {
    key: string,
    value: string,
  }

    record video {
    uri: option<string>,
    base64-bytes: option<list<u8>>,
    mime-type: string,
    width: option<u32>,
    height: option<u32>,
    fps: option<f32>,
    duration-seconds: option<f32>,
  }

  variant job-status {
    pending,
    running,
    succeeded,
    failed(string),
  }

  record video-result {
    status: job-status,
    videos: option<list<video>>,
    metadata: option<list<kv>>,
  }
}

interface video-generation {
  use types.{media-input, generation-config, video-result, video-error};
  
  // changed output from string to result<string, video-error>
  // easier to pass input-invalid, generation error
  // for all generate func
  generate: func(input: media-input, config: generation-config) -> result<string, video-error>;
  poll: func(job-id: string) -> result<video-result, video-error>;
  cancel: func(job-id: string) -> result<string, video-error>;
}

interface lip-sync {
  use types.{video-error, media-data};

// Define the two possible audio source, using voice-id or input audio
  variant audio-source {
    from-text(text: string, voice-id: option<string>, speed: u32),
    from-audio(narration-audio: media-data),
  }

  generate: func(
    input: (base-video: media-data),
    audio: audio-source,
  ) -> result<string, video-error>;

  record voice-info {
    voice-id: string,
    name: string,
    language: string,
    gender: option<string>,
    preview-url: option<string>,
  }

  list-voices: func(language: option<string>) -> result<list<voice-info>, video-error>;
}

interface advanced {
    use types.{video-error, kv};

    // Supported in Kling and veo
    extend-video: func(
        input: base-video,
        prompt: option<string>,
        duration: option<f32>,
    ) -> result<string, video-error>;

    // Supported in runway
    upscale-video: func(
        input: base-video,
    ) -> result<string, video-error>;

    // Supported in kling only
    video-effects: func(
        input: input-image,
        second-image: option<input-image>,
        effect: string,
    ) -> result<string, video-error>;
    
    // Multi image generation, kling Only
    multi-image-generation: func(
        input: input-image,
        other-images: list<input-image>, //Upto max 3 more
        config: generation-config,
    ) -> result<string, video-error>;
}

// I have left this as is, I would like a clarification for this
// I also dont get why no introspection
interface templates {
  use types.{video-error, kv};
  generate-from-template: func(
    template-id: string,
    variables: list<kv>
  ) -> string;
}

world video-generation {
  import types;
  import video-generation;
  import lip-sync;
  import advanced;
  import templates;

  export api: video-generation;
  export lip-sync;
  export template-videos: templates;
  export video-effects: effects;
}

@jdegoes
Copy link
Copy Markdown
Contributor

jdegoes commented Jun 27, 2025

@Nanashi-lab

I did not spend much time on this WIT so I am glad you took a closer look.

I like your proposed revisions and would suggest a few more:

  • Delete more strings, e.g. voice-info.gender, video-effect(effect: string). You can use enum or variant to encode the information much more precisely, in a way that is not "stringly-typed".
  • Instead of having job-id (a pattern I used earlier), use a resource for the job so the user doesn't have to pass stringly-typed information
  • Delete templates, it seems useless to me, same thing can be done in user-land

@Nanashi-lab Nanashi-lab changed the title Durable video generation (golem:video-generation) Durable video generation (golem:video) Jul 10, 2025
@Nanashi-lab Nanashi-lab marked this pull request as ready for review July 17, 2025 14:52
@Nanashi-lab
Copy link
Copy Markdown
Contributor Author

Nanashi-lab commented Jul 17, 2025

Changes to the wit since last time -

  1. Left job-id as string rather than resource, I wasn't sure how resource would look like, and if it can be passed, even in testing I was polling with job-id as string input, rather than regenerating whole request.

  2. Extend Video, both Veo and Kling support extend, Veo supports it as part of image to video, by taking video as input here, must be veo generated video.
    Kling also supports video extension, but using video-id, which is a global identifier for each of the video generated by Kling, so I have separated out both extend.
    Veo Extend is part of the generate video function, taking video as input
    While Extend function in advanced is used for Kling, which take video-id, and outputs extended video

  3. Since we are using video-id, Lip-sync which supports both video input and video-id, now also accepts video-id

There were other minor changes, added advanced camera-config (always the plan), made provider options optional, rather than passing empty KV, minor stuff in the advanced features to fit the api better. Multi-image is just list<image>, rather than, image and list<image>.

@jdegoes Thanks again for assigning this to me, this was a lot of fun. If golem:image is a possible bounty, I would love to be assigned to this. I can do my own research, make wit and confirm and then implement.

@mschuwalow
Copy link
Copy Markdown
Contributor

@Nanashi-lab could you resolve the conflicts please

@Nanashi-lab
Copy link
Copy Markdown
Contributor Author

Removed all the minor changes, will make a seperate PR for that

  • There are also some minor unrelated edits, changed build-all and release-build-all back to manually copying file into components, as method in latest does not work.
  • Made subfolder is test for llm, video, video-advanced golem app.
  • Fixed the ollama integeration test BUG: Ollama test suite in CI is a false positive. It succeeds even though the tests don't run to completion #58
  • Split the contents in README, test parts go in test, video part goes to video and llm parts goes to llm.
  • Move build-test-components from each of the individual folder, to main makefile.toml

Copy link
Copy Markdown
Contributor

@mschuwalow mschuwalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor cleanup, looks good to me otherwise!

Ok(PollResponse::Processing)
} else if status.is_success() {
// 200 - Complete, get video data
let video_bytes = response
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the entire video into a byte array into an array in memory is not great. (As they can be very large and workers should avoid using too much memory).

A better option would have been to streamingly read and streamingly write to e.g. the blob storage. You could then return a blob storage handle to the caller. We didn't spec the wit that way though, so fine to leave as is.

Copy link
Copy Markdown
Contributor Author

@Nanashi-lab Nanashi-lab Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Moved video and video-advanced to test project
  • Replaced infallible with persist and replay
  • Refactored Logging into a single function in durability (similar to noise's PR)
  • There is a Bug in test/video-advanced it now needs cargo-component 0.21.1 or errors (not sure why)
  • I also have a README.md for video parts of the file, which is not part of this PR.

@mschuwalow mschuwalow merged commit 9ac145d into golemcloud:main Jul 25, 2025
5 checks passed
@jdegoes
Copy link
Copy Markdown
Contributor

jdegoes commented Jul 25, 2025

@Nanashi-lab

Thanks again for assigning this to me, this was a lot of fun. If golem:image is a possible bounty, I would love to be assigned to this. I can do my own research, make wit and confirm and then implement.

Please write up a ticket, and if I like it, I'll attach a bounty and give you a 3 week exclusive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Durable Video Generation for Multiple Providers (golem:video-generation)

4 participants