This is the official repo of the multi-modal learning via multi-objective optimization (MIMO) algorithm designed for efficient modality imbalance mitigation (see the paper Mitigating Modality Imbalance in Multi-modal Learning via Multi-objective Optimization).
Multi-modal learning (MML) aims to integrate information from multiple modalities, which is expected to lead to superior performance over single-modality learning. However, recent studies have shown that MML can underperform, even compared to single-modality approaches, due to imbalanced learning across modalities. Methods have been proposed to alleviate this imbalance issue using different heuristics, which often lead to computationally intensive subroutines. In this paper, we reformulate the MML problem as a multi-objective optimization (MOO) problem that overcomes the imbalanced learning issue among modalities and propose a gradient-based algorithm to solve the modified MML problem. The resulting algorithm shows improved performance on popular MML benchmarks compared to existing baselines, while demonstrating up to ∼20× reduction in subroutine computation time.
Below we give some representative results. For complete results, please see the paper. The left figure provides a comparison of the training and testing performance of the MIMO algorithm with vanilla MML (joint training with sum fusion) on the CREMA-D dataset. The middle and right figures provide a comparison of the loss landscape of vanilla MML and MIMO after 1500 iterations on the CREMA-D dataset. Here, the black contours denote the multi-modal training loss, and the yellow dashed contours denote the multi-modal testing loss. The red star denotes the convergent point of each method. The color of the heatmap denotes the difference between uni-modal training accuracies at the given point of the loss landscape, where blue denotes the audio modality is dominating, green denotes the visual modality is dominating, and higher color intensity denotes larger differences in accuracy. As illustrated by the training curves and loss landscapes, MIMO achieves lower multi-modal test loss (i.e., better generalization) by balancing the learning of each modality.
Please follow the instructions in the README files corresponding to each codebase environment in the src folder for setting up the required datasets and the environments for running the experiments. We use src/agm_base to run experiments with the CREMA-D, UR-Funny, AV-MNIST, CMU-MOSEI, and AVE datasets, and src/ogm_ge_base to run experiments with the VGGSound and Kinetics-Sound datasets.
We demonstrate training with MIMO using the CREMA-D dataset. An example experiment parameter configuration is given below:
dir=cremad_logs
mkdir -p $dir
data_root=data/cremad
dataset=CREMAD
epochs=100
seed=1000
cuda=0
methods=MTL-MIMO
device=cuda:$cuda
lr="1e-3"
lambd_mimo=100.0
mu_mimo=0.01
modulation_ends=$epochs
fusion_type=late_fusion
modality=MultimodalThen, run the following commands to run the experiments and log the results:
cd src/agm_base
logname=$dir/$dataset-$methods-lr-$lr-epochs-$epochs-$fusion_type-$modality-$seed.out
echo "python -u main.py --data_root $data_root --dataset $dataset --device $device --methods $methods --lambd_mimo $lambd_mimo --mu_mimo $mu_mimo --modality $modality --fusion_type $fusion_type --random_seed $seed --expt_dir checkpoint --expt_name test --batch_size 64 --EPOCHS $epochs --modulation_ends $modulation_ends --learning_rate $lr --lr_decay_ratio 0.9 > $logname 2>&1 "
python -u main.py --data_root $data_root --dataset $dataset --device $device --methods $methods --lambd_mimo $lambd_mimo --mu_mimo $mu_mimo --modality $modality --fusion_type $fusion_type --random_seed $seed --expt_dir checkpoint --expt_name test --batch_size 64 --EPOCHS $epochs --modulation_ends $modulation_ends --learning_rate $lr --lr_decay_ratio 0.9 > $logname 2>&1 We would like to thank the authors of OGM-GE_CVPR2022 and AGM codebases, upon which this codebase is built!

