Durable video generation (golem:video)#51
Conversation
|
@jdegoes hi, help me with some clarifications and wit changes. I have also proposed a wit with changes, in the next comment. Current State of PR (Completed parts)
Wit ChangesConfig
This enum does not align with the any of the api,
This config is from runway text-to-image, since I am doing text-to-image, as part of text-to-video, I can fit this, but feels out of place and better in
Minor changes -
Avatar
This matches with Kling's Lip-sync, maybe they supported avatars in the past, but now kling can do lip-sync on any input video. (polling returns a failed(face-detection) error if no face.)
voice-id is match for how Kling supports audio, in speak function it is a choice, either [voice-id, text, speed] or [input audio file], no background audio for either. Effects
both style-guide and background removal are for image to image (supported by runway and stability) Suggested Replacements -
Others
Template I did not understand this at all, I could not find any API references, Am I meant to pre-create template, with already existing prompt/image so it can be used as a test ? I am fairly confident on my proposed changes, as I am familiar with the api now I have implemented text-to-video and image-to-video. Official documentation. |
|
This is my proposed wit, this mirrors the feature available with providers while remaining consistent with the original wit. This doesn't include kling advanced camera and mask options. package golem:video-generation
interface types {
variant video-error {
invalid-input(string),
unsupported-feature(string),
quota-exceeded,
generation-failed(string),
cancelled,
internal-error(string),
}
variant media-input {
text(string),
image(reference),
}
// Added prompt
record reference {
data: input-image,
prompt: option<string>,
role: option<image-role>,
}
// Changed to first and last
enum image-role {
first,
last,
}
record input-image {
data: media-data,
}
record base-video {
data: media-data,
}
record narration {
data: media-data,
}
variant media-data {
url(string),
bytes(list<u8>),
}
record generation-config {
negative-prompt: option<string>,
seed: option<u64>,
scheduler: option<string>,
guidance-scale: option<f32>,
aspect-ratio: option<aspect-ratio>,
duration-seconds: option<f32>,
resolution: option<resolution>,
enable-audio: option<bool>,
enhance-prompt: option<bool>,
provider-options: list<kv>,
///Added model and lastframe (Kling Only)
model: option<string>,
lastframe: option<input-image: media-data>,
}
enum aspect-ratio {
square,
portrait,
landscape,
cinema,
}
enum resolution {
sd,
hd,
fhd,
uhd,
}
record kv {
key: string,
value: string,
}
record video {
uri: option<string>,
base64-bytes: option<list<u8>>,
mime-type: string,
width: option<u32>,
height: option<u32>,
fps: option<f32>,
duration-seconds: option<f32>,
}
variant job-status {
pending,
running,
succeeded,
failed(string),
}
record video-result {
status: job-status,
videos: option<list<video>>,
metadata: option<list<kv>>,
}
}
interface video-generation {
use types.{media-input, generation-config, video-result, video-error};
// changed output from string to result<string, video-error>
// easier to pass input-invalid, generation error
// for all generate func
generate: func(input: media-input, config: generation-config) -> result<string, video-error>;
poll: func(job-id: string) -> result<video-result, video-error>;
cancel: func(job-id: string) -> result<string, video-error>;
}
interface lip-sync {
use types.{video-error, media-data};
// Define the two possible audio source, using voice-id or input audio
variant audio-source {
from-text(text: string, voice-id: option<string>, speed: u32),
from-audio(narration-audio: media-data),
}
generate: func(
input: (base-video: media-data),
audio: audio-source,
) -> result<string, video-error>;
record voice-info {
voice-id: string,
name: string,
language: string,
gender: option<string>,
preview-url: option<string>,
}
list-voices: func(language: option<string>) -> result<list<voice-info>, video-error>;
}
interface advanced {
use types.{video-error, kv};
// Supported in Kling and veo
extend-video: func(
input: base-video,
prompt: option<string>,
duration: option<f32>,
) -> result<string, video-error>;
// Supported in runway
upscale-video: func(
input: base-video,
) -> result<string, video-error>;
// Supported in kling only
video-effects: func(
input: input-image,
second-image: option<input-image>,
effect: string,
) -> result<string, video-error>;
// Multi image generation, kling Only
multi-image-generation: func(
input: input-image,
other-images: list<input-image>, //Upto max 3 more
config: generation-config,
) -> result<string, video-error>;
}
// I have left this as is, I would like a clarification for this
// I also dont get why no introspection
interface templates {
use types.{video-error, kv};
generate-from-template: func(
template-id: string,
variables: list<kv>
) -> string;
}
world video-generation {
import types;
import video-generation;
import lip-sync;
import advanced;
import templates;
export api: video-generation;
export lip-sync;
export template-videos: templates;
export video-effects: effects;
}
|
|
I did not spend much time on this WIT so I am glad you took a closer look. I like your proposed revisions and would suggest a few more:
|
6b574fe to
48a05ff
Compare
|
Changes to the wit since last time -
There were other minor changes, added advanced camera-config (always the plan), made provider options optional, rather than passing empty KV, minor stuff in the advanced features to fit the api better. Multi-image is just @jdegoes Thanks again for assigning this to me, this was a lot of fun. If golem:image is a possible bounty, I would love to be assigned to this. I can do my own research, make wit and confirm and then implement. |
|
@Nanashi-lab could you resolve the conflicts please |
48a05ff to
f50f995
Compare
|
Removed all the minor changes, will make a seperate PR for that
|
mschuwalow
left a comment
There was a problem hiding this comment.
Some minor cleanup, looks good to me otherwise!
| Ok(PollResponse::Processing) | ||
| } else if status.is_success() { | ||
| // 200 - Complete, get video data | ||
| let video_bytes = response |
There was a problem hiding this comment.
Reading the entire video into a byte array into an array in memory is not great. (As they can be very large and workers should avoid using too much memory).
A better option would have been to streamingly read and streamingly write to e.g. the blob storage. You could then return a blob storage handle to the caller. We didn't spec the wit that way though, so fine to leave as is.
There was a problem hiding this comment.
- Moved video and video-advanced to test project
- Replaced infallible with persist and replay
- Refactored Logging into a single function in durability (similar to noise's PR)
- There is a Bug in test/video-advanced it now needs cargo-component 0.21.1 or errors (not sure why)
- I also have a README.md for video parts of the file, which is not part of this PR.
3219bd9 to
e4e15c7
Compare
abaee09 to
64c40c1
Compare
Please write up a ticket, and if I like it, I'll attach a bounty and give you a 3 week exclusive! |
/closes #44
/claim #44
There are 5 test for all 4 Providers
Runway Test Video
Stability Test Video
Kling Test Video
Veo Test Video
Advanced Test for Kling only, there are 11 Tests here, covering, first and last image, advanced camera config, masking features, lip-sync, extend video, multi image to video.
Kling Advanced Test Video
There is no audio in the video, but since 4 of tests, single effect, lip-sync1, lip-sync2, and lip-sync3 have audio, I have added the actual output files to the end with audio.
Note - All Kling tests have been cut in the middle, as each generation/polling takes 3-7 minutes. Some of the advanced test do multiple generation one after another, example text2video->extend->lip-sync
Official documentation.
Kling
Veo
Runway
Stability