Conversation
|
@sayakpaul Could you give this a review? Note that I've left some todos for myself in future refactors and we should prioritize getting the trainers out there. I will need some more time to complete the longer finetuning run I was trying. I accidentally set |
| dataloader_cmd="--dataloader_num_workers 0" | ||
|
|
||
| # Diffusion arguments | ||
| diffusion_cmd="--flow_resolution_shifting" |
There was a problem hiding this comment.
| diffusion_cmd="--flow_resolution_shifting" | |
| diffusion_cmd="" |
There was a problem hiding this comment.
Removing because I'm yet to test which is better since we don't know how exactly Hunyuan was trained
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks for getting this in quickly!
| --caption_column $CAPTION_COLUMN \ | ||
| --id_token BW_STYLE \ | ||
| --video_resolution_buckets 17x512x768 49x512x768 61x512x768 129x512x768 \ | ||
| --video_resolution_buckets 49x512x768 \ |
There was a problem hiding this comment.
This was incorrect when I merged LTX. I copied values from my multiresolution run but validation prompts and other settings from single resolution run
| --seed 42 \ | ||
| --mixed_precision bf16 \ | ||
| --batch_size 1 \ | ||
| --train_steps 2000 \ |
There was a problem hiding this comment.
So, train shorter with a smaller LR? 👁️
There was a problem hiding this comment.
Higher learning rate seems to make the model worse somehow when doing stylistic training :/ Yet to find the optimal training configuration for LTXV but ~1000-1500 steps seems to be okay
| ).to("cuda") | ||
| + pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="ltxv-lora") | ||
| + pipe.set_adapters(["ltxv-lora"], [1.0]) | ||
| + pipe.set_adapters(["ltxv-lora"], [0.75]) |
There was a problem hiding this comment.
Since I haven't found the optimal training settings yet for the LoRA, using the full power of it at 1.0 leads to slightly worse quality outputs. This seems to strike a nice balance but ideally should be explored by the person who trained
| --max_grad_norm 1.0" | ||
|
|
||
| # Validation arguments | ||
| validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \ |
There was a problem hiding this comment.
🧠 @@@129x512x768. I kid you not I thought it was something else completely.
| cmd="accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids $GPU_IDS train.py \ | ||
| $model_cmd \ | ||
| $dataset_cmd \ | ||
| $dataloader_cmd \ | ||
| $diffusion_cmd \ | ||
| $training_cmd \ | ||
| $optimizer_cmd \ | ||
| $validation_cmd \ | ||
| $miscellaneous_cmd" |
There was a problem hiding this comment.
Wow, very neat way to segregate the commands!
| def _add_model_arguments(parser: argparse.ArgumentParser) -> None: | ||
| parser.add_argument("--model_name", type=str, required=True, choices=["ltx_video"], help="Name of model to train.") | ||
| parser.add_argument( | ||
| "--model_name", type=str, required=True, choices=["hunyuan_video", "ltx_video"], help="Name of model to train." |
There was a problem hiding this comment.
We could determine the choices automatically from the config map we have right now. TODO
| return pipe | ||
|
|
||
|
|
||
| def prepare_conditions( |
There was a problem hiding this comment.
Should this be decorated with torch.no_grad()?
| if isinstance(prompt, str): | ||
| prompt = [prompt] | ||
|
|
||
| conditions = {} | ||
| conditions.update( | ||
| _get_llama_prompt_embeds(tokenizer, text_encoder, prompt, prompt_template, device, dtype, max_sequence_length) | ||
| ) | ||
| conditions.update(_get_clip_prompt_embeds(tokenizer_2, text_encoder_2, prompt, device, dtype)) | ||
|
|
||
| guidance = torch.tensor([guidance], device=device, dtype=dtype) * 1000.0 | ||
| conditions["guidance"] = guidance | ||
|
|
||
| return conditions |
There was a problem hiding this comment.
Wonder if it's possible to leverage the encode_prompt() from the pipeline itself. TODO
There was a problem hiding this comment.
I think it's better to keep custom implementations here per model because it is clean to understand and debug without jumping to diffusers codebase. Also, our pipelines contain some additional things and checks at times - let's revisit this idea maybe later
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
|
All yours @sayakpaul for the initial designing and refactors 🪄 I'm still trying to figure out how best to implement precomputation because the current approach just loads all the models and is not really ideal. I will have a refactor out in a few hours |
Script:
Slurm: