train

Quick Start

The workflow has two steps: Step 1 extract VLM embeddings offline, Step 2 train using the extracted embeddings. Multi-dataset training with configurable weights is supported.

Step 1: Offline VLM Embedding Extraction

Use src/extract_vlm_embeds.py to read image–text pairs from JSONL, extract multimodal embeddings via Qwen2-VL, and save them to disk for later training. This reduces GPU memory usage and improves utilization.

Data format: Input is JSONL, one sample per line. Required fields (names can be overridden via arguments; below are the defaults):

source_image: Path or URL of the source image(s) (can be a list for multiple source images)
target_image: Path or URL of the target image
Text / instruction fields (used to build VLM dialogue and embeddings; treated as empty string if missing):
- instruction: Main instruction (English), describing how to go from source to target image
- instruction_cn: Main instruction (Chinese)
- inverse_instruction: (Optional) Inverse instruction (English), text description inferred from the target image; only needed when not using --disable_inverse
- inverse_instruction_cn: (Optional) Inverse instruction (Chinese)
Note: With --disable_inverse, only instruction / instruction_cn are required. With --t2i_mode (text-to-image), only the main instruction is needed, and inverse is implicitly disabled.

JSONL example:

{"source_image": "/data/img/001.png", "target_image": "/data/img/001_edit.png", "instruction": "Change the sky to sunset.", "instruction_cn": "把天空改成日落。", "inverse_instruction": "A photo of a landscape with a blue sky.", "inverse_instruction_cn": "一张蓝天下的风景照。"}
{"source_image": ["/data/ref1.png", "/data/ref2.png"], "target_image": "/data/out.png", "instruction": "Merge the two characters into one scene.", "instruction_cn": "把两个角色合成到一个场景里。"}
{"source_image": null, "target_image": "/data/generated.png", "instruction": "A cat sitting on a windowsill.", "instruction_cn": "一只猫坐在窗台上。"}

In the above: the first line is standard image editing (with inverse instruction); the second uses multiple source images; the third is T2I (no source image; use with --t2i_mode --disable_inverse).

Single-node multi-GPU example (run from project root):

# Single node, 8 GPUs
export GPUS_PER_NODE=8
export NNODES=1
export NODE_RANK=0
export MASTER_ADDR=localhost
export MASTER_PORT=6003

torchrun --nproc_per_node=$GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK \
  --master_port $MASTER_PORT --master_addr $MASTER_ADDR \
  src/extract_vlm_embeds.py \
  /path/to/your/data.jsonl \
  --output_jsonl_dir /path/to/output_jsonl \
  --embeddings_save_dir /path/to/embeddings \
  --model_path /path/to/FireRed-Image-Edit-1.0 \
  --batch_size 4

Common arguments:

Argument	Description	Default
`jsonl_path`	Input JSONL path	Required
`--output_jsonl_dir`	Output JSONL directory (one file per rank)	Required
`--embeddings_save_dir`	Directory to save embeddings	Required
`--model_path`	FireRed-Image-Edit-1.0 model path	`/dev/shm/FireRed-Image-Edit-1.0`
`--batch_size`	Batch size	4
`--image_sample_size`	Image sampling size	512
`--disable_inverse`	Disable inverse prompt	-
`--t2i_mode`	T2I mode (text-to-image)	-

For multi-node, set WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT and launch with torchrun using --nnodes, --node_rank, etc. More examples in examples/extract_vlm_embeds.sh.

Step 2: Training

The training script reads the directory layout produced in Step 1: each subdirectory is treated as one task (e.g. one dataset), containing its JSONL and embedding files. Use train_data_weights to set sampling weights per task for mixed training.

Directory convention:

Under train_data_meta_dir, each top-level subdirectory = one task (e.g. dataset_a, dataset_b).
Each task directory contains that task’s JSONL and embeddings (i.e. Step 1’s --output_jsonl_dir and --embeddings_save_dir organized per task under this directory).

Single-node multi-GPU training example:

export GPUS_PER_NODE=8
export NNODES=1
export NODE_RANK=0
export MASTER_ADDR=localhost
export MASTER_PORT=6003

accelerate launch --mixed_precision="bf16" --use_fsdp \
  --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP \
  --fsdp_transformer_layer_cls_to_wrap=QwenImageTransformerBlock \
  --fsdp_state_dict_type=SHARDED_STATE_DICT \
  --num_processes $((GPUS_PER_NODE * NNODES)) \
  --num_machines $NNODES \
  --machine_rank $NODE_RANK \
  --main_process_ip $MASTER_ADDR \
  --main_process_port $MASTER_PORT \
  -m src.sft \
  --pretrained_model_name_or_path="/path/to/FireRed-Image-Edit-1.0" \
  --train_data_meta_dir="/path/to/your_meta_dir" \
  --train_data_weights="dataset_a=0.5,dataset_b=1.2,dataset_c=1.0" \
  --train_src_img_num_weights="0=1,1=1,2=1,3=0" \
  --train_batch_size=1 \
  --image_sample_size=512 \
  --gradient_accumulation_steps=1 \
  --num_train_epochs=1 \
  --max_train_steps=512 \
  --learning_rate=2e-05 \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --checkpointing_steps=100 \
  --output_dir="/path/to/ckpts" \
  --gradient_checkpointing \
  --mixed_precision="bf16" \
  --adam_weight_decay=3e-2 \
  --max_grad_norm=0.05 \
  --uniform_sampling \
  --trainable_modules "." \
  --vae_mini_batch=1

Mixed-training related arguments:

Argument	Description
`--train_data_meta_dir`	Root directory for training meta; its top-level subdirs are tasks (datasets)
`--train_data_weights`	Sampling weight per task, format: `task1=w1,task2=w2`; tasks not listed are excluded
`--train_src_img_num_weights`	Weight by number of source images, format: `0=w0,1=w1,2=w2,3=w3` (0/1/2/3 source images)

Customize train_data_weights with your task names (same as the subdirectory names under train_data_meta_dir) and the sampling weights you want; only tasks listed here are included in training. For more runnable scripts see examples/; for the full list of training arguments see src/arguments.py.

LoRA Training

For parameter-efficient fine-tuning, you can use PEFT LoRA instead of full-parameter training. Add --use_peft_lora and set --lora_r (e.g. 8, 16, 32, 64), --lora_alpha (often same as lora_r or 2×), and optionally --lora_dropout and --lora_target_modules. With LoRA, only adapter weights are trained; checkpoints save the adapter only (when using FSDP, the full model is not merged). To resume or load a pretrained adapter, use --lora_path. Example: examples/train_lora.sh.

Argument	Description
`--use_peft_lora`	Enable PEFT LoRA fine-tuning
`--lora_r`	LoRA rank (e.g. 8, 16, 32, 64)
`--lora_alpha`	LoRA alpha (often same as `lora_r`)
`--lora_dropout`	LoRA dropout
`--lora_target_modules`	Comma-separated module names to apply LoRA
`--lora_path`	Path to pretrained LoRA adapter (for resume or inference)

Dependencies and Environment

Python 3.12
PyTorch, Transformers, Accelerate, Diffusers, etc. (see project requirements or your environment setup)

Name		Name	Last commit message	Last commit date
parent directory ..
examples		examples
src		src
README.md		README.md
README_CN.md		README_CN.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Quick Start

Step 1: Offline VLM Embedding Extraction

Step 2: Training

LoRA Training

Dependencies and Environment

FilesExpand file tree

train

Directory actions

More options

Directory actions

More options

Latest commit

History

train

Folders and files

parent directory

README.md

Quick Start

Step 1: Offline VLM Embedding Extraction

Step 2: Training

LoRA Training

Dependencies and Environment