LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai

Abstract

While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.

Overview

Results

Install

conda create --name lira python=3.10 -y
conda activate lira

# install pytorch with cuda 11.8
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118

# install omg-seg requirements
python -m pip install https://github.com/open-mmlab/mmengine/archive/refs/tags/v0.8.5.zip
TORCH_CUDA_ARCH_LIST="8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" CUDA_HOME=$(dirname $(dirname $(which nvcc))) LD_LIBRARY_PATH=$(dirname $(dirname $(which nvcc)))/lib MMCV_WITH_OPS=1 FORCE_CUDA=1 python -m pip install git+https://github.com/open-mmlab/mmcv.git@4f65f91db6502d990ce2ee5de0337441fb69dd10

python -m pip install \
https://github.com/open-mmlab/mmdetection/archive/refs/tags/v3.1.0.zip \
https://github.com/open-mmlab/mmsegmentation/archive/refs/tags/v1.1.1.zip \
https://github.com/open-mmlab/mmpretrain/archive/refs/tags/v1.0.1.zip

# If the installation fails, you can try adding `--no-build-isolation`.
# python -m pip install \
# https://github.com/open-mmlab/mmdetection/archive/refs/tags/v3.1.0.zip --no-build-isolation \
# https://github.com/open-mmlab/mmsegmentation/archive/refs/tags/v1.1.1.zip --no-build-isolation \
# https://github.com/open-mmlab/mmpretrain/archive/refs/tags/v1.0.1.zip --no-build-isolation

# install other requirements
pip install -e '.[all]'

Our installation process follows OMG-LLaVA, with adjustments to the versions of packages such as transformers to ensure compatibility with InternVL2.

You can download the corresponding version of flash_attention from https://github.com/Dao-AILab/flash-attention/releases/ and use the following code to install:

pip install flash_attn-2.3.6+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation

Dataset

Download data from data.

cat data.zip.part* > data.zip
unzip data.zip

Weights

Download model

python download_model.py -n echo840/LIRA

Download InternVL

python download_model.py -n OpenGVLab/InternVL2-2B # OpenGVLab/InternVL2-8B

Demo

python ./omg_llava/tools/app_lira.py ./omg_llava/configs/finetune/LIRA-2B.py ./model_weight/LIRA-2B.pth

Train

Pretrain

bash ./scripts/pretrain.sh

After train, please use the tools to convert deepspeed chekpoint to pth format

python omg_llava/tools/convert_deepspeed2pth.py
    ${PATH_TO_CONFIG} \
    ${PATH_TO_DeepSpeed_PTH} \
    --save-path ./pretrained/${PTH_NAME.pth}

Finetune

bash ./scripts/finetune.sh

Evaluation

bash ./scripts/eval_gcg.sh #  Evaluation on Grounded Conversation Generation Tasks.

bash ./scripts/eval_refseg.sh # Evaluation on Referring Segmentation Tasks.

bash ./scripts/eval_vqa.sh # Evaluation on Comprehension Tasks.

Acknowledgments

Our code is built upon OMGLLaVA and InternVL2, and we sincerely thank them for providing the code and base models. We also thank OPERA for providing the evaluation code for chair.

Copyright

If you have any questions, please feel free to contact us at zhangli123@hust.edu.cn.

Citation

If you wish to refer to the baseline results published here, please use the following BibTeX entries:

@inproceedings{li2025lirainferringsegmentationlarge,
  title={LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance},
  author={Zhang Li and Biao Yang and Qiang Liu and Shuo Zhang and Zhiyin Ma and Liang Yin and Linger Deng and Yabo Sun and Yuliang Liu and Xiang Bai},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
asserts		asserts
internvl		internvl
omg_llava		omg_llava
requirements		requirements
results		results
scripts		scripts
work_dirs		work_dirs
xtuner		xtuner
README.md		README.md
_ext.cpython-310-x86_64-linux-gnu.so		_ext.cpython-310-x86_64-linux-gnu.so
download_model.py		download_model.py
requirements.txt		requirements.txt
requirements_export.txt		requirements_export.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Abstract

Overview

Results

Install

Dataset

Weights

Demo

Train

Evaluation

Acknowledgments

Copyright

Citation

About

Uh oh!

Releases

Packages

Languages

echo840/LIRA

Folders and files

Latest commit

History

Repository files navigation

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Abstract

Overview

Results

Install

Dataset

Weights

Demo

Train

Evaluation

Acknowledgments

Copyright

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages