Skip to content

H-EmbodVis/NAUTILUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu1*, Cheng Wang1*, Dingkang Liang1, Zongchuang Zhao1, Xingyu Jiang1, Peng Zhang2, Xiang Bai1

1 Huazhong University of Science & Technology, 2 National University of Defense Technology

(*) Equal contribution.

arXiv Github Project Code License

News

  • [2025/10/30] Release the official version of NAUTILUS.
  • [2025/09/18] NAUTILUS is accepted to NeurIPS 2025! 🥳🥳🥳

Contents

Introduction

Contributions

NAUTILUS contributes to three aspects:

  1. We construct NautData, a large-scale underwater instruction-following dataset containing 1.45 M image-text pairs, enabling developments and evaluations of underwater LMMs.
  2. We build the first eight-task underwater LMM NAUTILUS, achieving underwater scene understanding from image, region, and object levels. It empowers comprehensive underwater scene understanding through aggregating hierarchical scene information.
  3. We design a plug-and-play VFE module motivated by a physical underwater imaging model. It restores degraded information explicitly in the feature space. Experiments on renowned baselines demonstrate its effectiveness on all the annotated tasks.

Pipeline

NAUTILUS comprises an image encoder, a depth encoder, a vision-to-language projector, a Vision Feature Enhancement (VFE) module, and an LLM. The proposed VFE module performs feature-space enhancement guided by physical priors through two sequential steps: (1) removing backscattering and (2) restoring light absorption.

NAUTILUS Performance

Performance of our NAUTILUS.
Methods Classification Caption Grounding Detection VQA Counting
Coarse Fine Image Region mIoU PR@0.5 mAP mAP@0.5 METEOR MAE ↓
acc acc METEOR METEOR
NAUTILUS (LLaVA-1.5) 91.089.9 0.2080.191 46.252.2 11.120.9 0.36551.2
NAUTILUS (Qwen2.5-VL) 90.393.8 0.2230.199 53.858.8 25.845.3 0.38130.9

Install

  1. Clone this repository and navigate to NAUTILUS folder
git clone https://github.com/Chengnotwang/NAUTILUS.git
cd NAUTILUS
  1. Environment Setup for NAUTILUS (LLaVA)

    Recommended Environment

  • CUDA 12.1
  • Python 3.10
# 1. Create a new conda environment
conda create -n nautilus_llava python=3.10 -y
conda activate nautilus_llava
# 2. Install requirements
cd LLaVA
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
# 3. Install flash-attn
pip install flash-attn==2.6.3 --no-build-isolation
# If installation fails, install the pre-built wheel from [https://github.com/Dao-AILab/flash-attention/releases/tag/v2.6.3] instead:
pip install flash_attn-2.6.3+cu123torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
  1. Environment Setup for NAUTILUS (Qwen)

    Recommended Environment

  • CUDA 12.4
  • Python 3.10
# 1. Create a new conda environment
conda create -n nautilus_qwen python=3.10 -y
conda activate nautilus_qwen
# 2. Install requirements
cd qwen-vl-finetune
pip install -r requirements.txt
# 3. Install flash-attn-2.7.3
pip install flash-attn==2.7.3 --no-build-isolation
# If installation fails, install the pre-built wheel from [https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.3] instead:
pip install flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Dataset

For details on NautData and training / evaluation preparation, please refer to NautData.

Train

Dataset Preparation and Preprocessing

Ensure that the NautData is fully downloaded and the corresponding annotation files are correctly organized as described in the previous section. Once prepared, you can proceed with the training instructions below.

NAUTILUS (LLaVA)

  1. Download llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 and depth_anything_v2_vitl.pth.

  2. (Optional) Preprocess the depth_anything_v2_vitl.pth weights into dino_vitl.pth by running the process_vitl_weight.py. Then, place the generated file in the directory specified in the finetuning script.

  3. Fine-tune NAUTILUS (LLaVA) with the provided script, in accordance with the instructions detailed below.

# Remove useless parameters and rename reamin parameters
python utils/process_vitl_weight.py --dav2-vitl "path to depth_anything_v2_vitl.pth" --dinov2-vitl "dinov2 only pth file"
conda activate nautilus_llava
cd LLavA
# Before training, please make sure to modify the path of the checkpoint and others hyper-parameters in the script
# Start training
bash scripts/nautilus_finetune/finetune_nautilus_lora.sh

NAUTILUS (Qwen)

  1. We use the Qwen2.5-VL-7B-Instruct model in our experiments. You may also choose other Qwen2.5-VL variants if desired.

  2. Verify that the annotation_path and data_path for Nautilus_Instruct are correctly configured in this file.

  3. Run the NAUTILUS (Qwen) fine-tuning script.

conda activate nautilus_qwen
cd qwen-vl-finetune
# Make sure to modify the path and others hyper-parameters in the script
# Start training
bash scripts/nautilus_finetune/nautilus_sft_7b_lora.sh

Model Weights

Model config Download Train log Eval json
NAUTILUS (LLaVA-1.5) config Huggingface log json
NAUTILUS (Qwen2.5-VL) config Huggingface log json

Local Inference

NAUTILUS (LLaVA)

Single-Sample Inference for NAUTILUS (LLaVA)

cd LLaVA
CUDA_VISIBLE_DEVICES=0 python scripts/inference/inference.py --model-path "path to checkpoint" --dinov2-weight "path to dinov2" --image "path to image" --prompt "question"
# prompt default is "Describe the image"

NAUTILUS (Qwen)

Single-Sample Inference for NAUTILUS (Qwen)

cd qwen-vl-finetune
CUDA_VISIBLE_DEVICES=0 python scripts/inference.py --checkpoint "path to checkpoint" --image "path to image" --prompt "question"
# prompt default is "Describe the image"

Evaluation

We provide a simple evaluation script for evaluating the performance of NAUTILUS.

# for evaluation
bash eval/eval_nautilus_llava.sh # for Nautilus (LLaVA)
bash eval/eval_nautilus_qwen.sh # for Nautilus (Qwen)

Acknowledgement

Sincere acknowledgment to the amazing open-source community for their great contributions:

  • LLaVA: for the excellent work and inspiring codebase.
  • Qwen-2.5-VL: for the powerful open-source model and codebase (now updated to Qwen-3-VL).
  • Depth AnythingV2: for the impressive work and for sharing their model weights with the community.

Citation

If you find this repository useful in your research, please consider giving a star ⭐ and a citation.

@inproceedings{xu2025nautilus,
        title={NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding},
        author={Xu, Wei and Wang, Cheng and Liang, Dingkang and Zhao, Zongchuang and Jiang, Xingyu and Zhang, Peng and Bai, Xiang},
        booktitle={Advances in Neural Information Processing Systems},
        year={2025}
  }