Skip to content

Official implementation of "Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging".

Notifications You must be signed in to change notification settings

WalkerWorldPeace/MLLMerging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Checkpoints

You can find MLLM checkpoints at 🤗 Hugging Face collection. The weights can also be automatically downloaded when running the model merging scripts below.

QwenVL Merging

  1. Install the development version and dependencies:

    cd LLaMA-Factory
    pip install -e ".[torch,metrics]" --no-build-isolation
    pip install qwen_vl_utils torchvision
  2. Select and modify the merge_method as needed, then run the merging script:

    python model_merging.py
  3. To evaluate QwenVL on RefCOCO, RefCOCO+, and RefCOCOg:

    • Prepare the evaluation environment:
      cd lmms-eval
      pip install -e .
      conda install openjdk=8
    • Download the datasets from Huggingface:
    • Run the evaluation:
      accelerate launch --num_processes=8 --main_process_port=12345 -m lmms_eval \
          --model qwen2_vl \
          --model_args=pretrained=merged_model_path,max_pixels=2359296 \
          --tasks refcoco_bbox_rec_val,refcoco+_bbox_rec_val,refcocog_bbox_rec_val \
          --batch_size 1 --log_samples --log_samples_suffix reproduce --output_path ./logs

InternVL Merging

  1. Install dependencies:

    cd InternVL
    pip install -r requirements.txt
    pip install timm
  2. Run the merging script:

    cd internvl_chat
    python model_merging.py
  3. Prepare datasets for RefCOCO, RefCOCO+, and RefCOCOg:

    # Create data directory and download annotation files
    mkdir -p data/refcoco && cd data/refcoco
    wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl
    wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl
    wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl
    
    # Download and unzip COCO images
    mkdir -p data/coco && cd data/coco
    wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
  4. Run evaluation:

    GPUS=8 bash evaluate.sh merged_model_path/ refcoco --dynamic

Evaluating the Merged Model

  1. Install VLMEvalKit and configure evaluation:

    cd VLMEvalKit
    pip install -e .
    • All VLMs are configured in vlmeval/config.py.
    • Update the model path in vlmeval/config.py and select the model and evaluation datasets in eval.sh.
  2. Run evaluation:

    bash eval.sh
  3. Summarize evaluation results:

    To quickly summarize all evaluation results, you can run:

    python results.py outputs/merge_model_name

Note: For reproducibility, use eager attention and load the model in float16.


Modality Merging

  1. Install dependencies:

    cd ModelCompose
    pip install -r requirements.txt
  2. Download required models and encoders:

  3. Merge models:

    python scripts/model_composition/merge_unimodal_modelcompose.py \
        checkpoints/multimodal-vicuna-7b-v1.5-video-naivemc \
        checkpoints/multimodal-vicuna-7b-v1.5-audio-naivemc \
        checkpoints/multimodal-vicuna-7b-v1.5-vision-naivemc \
        -o multimodal-checkpoint-name --strategy merge-ties
    • You can change the merging method with the --strategy argument.
  4. Evaluate the merged three-modality model:

    • AVQA:
      bash scripts/model_composition/test/avqa.sh 0,1,2,3,4,5,6,7 multimodal-checkpoint-name video+image+audio checkpoints/vicuna-7b-v1.5
    • MUSIC-AVQA:
      bash scripts/model_composition/test/music_avqa_video+image+audio.sh 0,1,2,3,4,5,6,7 multimodal-checkpoint-name checkpoints/vicuna-7b-v1.5

Acknowledgement

This project thanks the following open source communities for their contributions:

Thanks to them for their contributions to the development of model training and evaluation tools!

About

Official implementation of "Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published