You can find MLLM checkpoints at 🤗 Hugging Face collection. The weights can also be automatically downloaded when running the model merging scripts below.
-
Install the development version and dependencies:
cd LLaMA-Factory pip install -e ".[torch,metrics]" --no-build-isolation pip install qwen_vl_utils torchvision
-
Select and modify the
merge_methodas needed, then run the merging script:python model_merging.py
-
To evaluate QwenVL on RefCOCO, RefCOCO+, and RefCOCOg:
- Prepare the evaluation environment:
cd lmms-eval pip install -e . conda install openjdk=8
- Download the datasets from Huggingface:
- Run the evaluation:
accelerate launch --num_processes=8 --main_process_port=12345 -m lmms_eval \ --model qwen2_vl \ --model_args=pretrained=merged_model_path,max_pixels=2359296 \ --tasks refcoco_bbox_rec_val,refcoco+_bbox_rec_val,refcocog_bbox_rec_val \ --batch_size 1 --log_samples --log_samples_suffix reproduce --output_path ./logs
- Prepare the evaluation environment:
-
Install dependencies:
cd InternVL pip install -r requirements.txt pip install timm -
Run the merging script:
cd internvl_chat python model_merging.py -
Prepare datasets for RefCOCO, RefCOCO+, and RefCOCOg:
# Create data directory and download annotation files mkdir -p data/refcoco && cd data/refcoco wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl # Download and unzip COCO images mkdir -p data/coco && cd data/coco wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
-
Run evaluation:
GPUS=8 bash evaluate.sh merged_model_path/ refcoco --dynamic
-
Install VLMEvalKit and configure evaluation:
cd VLMEvalKit pip install -e .
- All VLMs are configured in
vlmeval/config.py. - Update the model path in
vlmeval/config.pyand select the model and evaluation datasets ineval.sh.
- All VLMs are configured in
-
Run evaluation:
bash eval.sh
-
Summarize evaluation results:
To quickly summarize all evaluation results, you can run:
python results.py outputs/merge_model_name
Note: For reproducibility, use eager attention and load the model in float16.
-
Install dependencies:
cd ModelCompose pip install -r requirements.txt -
Download required models and encoders:
- Pretrained LLM: vicuna-7b-v1.5
- Finetuned LoRAs for different modalities: ModelCompose (put in
checkpoints/) - Encoders for different modalities:
modelcompose/model/multimodal_encoder/beats: beatsmodelcompose/model/multimodal_encoder/clip-vit-large-patch14-336: clip-vit-large-patch14-336modelcompose/model/multimodal_encoder/LanguageBind_Video_merge: LanguageBind_Video_merge
-
Merge models:
python scripts/model_composition/merge_unimodal_modelcompose.py \ checkpoints/multimodal-vicuna-7b-v1.5-video-naivemc \ checkpoints/multimodal-vicuna-7b-v1.5-audio-naivemc \ checkpoints/multimodal-vicuna-7b-v1.5-vision-naivemc \ -o multimodal-checkpoint-name --strategy merge-ties- You can change the merging method with the
--strategyargument.
- You can change the merging method with the
-
Evaluate the merged three-modality model:
- AVQA:
bash scripts/model_composition/test/avqa.sh 0,1,2,3,4,5,6,7 multimodal-checkpoint-name video+image+audio checkpoints/vicuna-7b-v1.5
- MUSIC-AVQA:
bash scripts/model_composition/test/music_avqa_video+image+audio.sh 0,1,2,3,4,5,6,7 multimodal-checkpoint-name checkpoints/vicuna-7b-v1.5
- AVQA:
This project thanks the following open source communities for their contributions:
Thanks to them for their contributions to the development of model training and evaluation tools!