NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu^1*, Cheng Wang^1*, Dingkang Liang¹, Zongchuang Zhao¹, Xingyu Jiang¹, Peng Zhang², Xiang Bai¹

¹ Huazhong University of Science & Technology, ² National University of Defense Technology

(*) Equal contribution.

News

[2025/10/30] Release the official version of NAUTILUS.
[2025/09/18] NAUTILUS is accepted to NeurIPS 2025! 🥳🥳🥳

Introduction

Contributions

NAUTILUS contributes to three aspects:

We construct NautData, a large-scale underwater instruction-following dataset containing 1.45 M image-text pairs, enabling developments and evaluations of underwater LMMs.
We build the first eight-task underwater LMM NAUTILUS, achieving underwater scene understanding from image, region, and object levels. It empowers comprehensive underwater scene understanding through aggregating hierarchical scene information.
We design a plug-and-play VFE module motivated by a physical underwater imaging model. It restores degraded information explicitly in the feature space. Experiments on renowned baselines demonstrate its effectiveness on all the annotated tasks.

Pipeline

NAUTILUS comprises an image encoder, a depth encoder, a vision-to-language projector, a Vision Feature Enhancement (VFE) module, and an LLM. The proposed VFE module performs feature-space enhancement guided by physical priors through two sequential steps: (1) removing backscattering and (2) restoring light absorption.

NAUTILUS Performance

Performance of our NAUTILUS.

Methods	Classification		Caption		Grounding		Detection		VQA	Counting
	Coarse	Fine	Image	Region	mIoU	PR@0.5	mAP	mAP@0.5	METEOR	MAE ↓
	acc	acc	METEOR	METEOR	mIoU	PR@0.5	mAP	mAP@0.5	METEOR	MAE ↓
NAUTILUS (LLaVA-1.5)	91.0	89.9	0.208	0.191	46.2	52.2	11.1	20.9	0.365	51.2
NAUTILUS (Qwen2.5-VL)	90.3	93.8	0.223	0.199	53.8	58.8	25.8	45.3	0.381	30.9

Install

Clone this repository and navigate to NAUTILUS folder

git clone https://github.com/Chengnotwang/NAUTILUS.git
cd NAUTILUS

Environment Setup for NAUTILUS (LLaVA)

Recommended Environment

CUDA 12.1
Python 3.10

# 1. Create a new conda environment
conda create -n nautilus_llava python=3.10 -y
conda activate nautilus_llava
# 2. Install requirements
cd LLaVA
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
# 3. Install flash-attn
pip install flash-attn==2.6.3 --no-build-isolation
# If installation fails, install the pre-built wheel from [https://github.com/Dao-AILab/flash-attention/releases/tag/v2.6.3] instead:
pip install flash_attn-2.6.3+cu123torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Environment Setup for NAUTILUS (Qwen)

Recommended Environment

CUDA 12.4
Python 3.10

# 1. Create a new conda environment
conda create -n nautilus_qwen python=3.10 -y
conda activate nautilus_qwen
# 2. Install requirements
cd qwen-vl-finetune
pip install -r requirements.txt
# 3. Install flash-attn-2.7.3
pip install flash-attn==2.7.3 --no-build-isolation
# If installation fails, install the pre-built wheel from [https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.3] instead:
pip install flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Dataset

For details on NautData and `training / evaluation` preparation, please refer to NautData.

Train

Dataset Preparation and Preprocessing

Ensure that the NautData is fully downloaded and the corresponding annotation files are correctly organized as described in the previous section. Once prepared, you can proceed with the training instructions below.

NAUTILUS (LLaVA)

Download llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 and depth_anything_v2_vitl.pth.
(Optional) Preprocess the depth_anything_v2_vitl.pth weights into dino_vitl.pth by running the process_vitl_weight.py. Then, place the generated file in the directory specified in the finetuning script.
Fine-tune NAUTILUS (LLaVA) with the provided script, in accordance with the instructions detailed below.

# Remove useless parameters and rename reamin parameters
python utils/process_vitl_weight.py --dav2-vitl "path to depth_anything_v2_vitl.pth" --dinov2-vitl "dinov2 only pth file"
conda activate nautilus_llava
cd LLavA
# Before training, please make sure to modify the path of the checkpoint and others hyper-parameters in the script
# Start training
bash scripts/nautilus_finetune/finetune_nautilus_lora.sh

NAUTILUS (Qwen)

We use the Qwen2.5-VL-7B-Instruct model in our experiments. You may also choose other Qwen2.5-VL variants if desired.
Verify that the annotation_path and data_path for Nautilus_Instruct are correctly configured in this file.
Run the NAUTILUS (Qwen) fine-tuning script.

conda activate nautilus_qwen
cd qwen-vl-finetune
# Make sure to modify the path and others hyper-parameters in the script
# Start training
bash scripts/nautilus_finetune/nautilus_sft_7b_lora.sh

Model Weights

Model	config	Download	Train log	Eval json
NAUTILUS (LLaVA-1.5)	config	Huggingface	log	json
NAUTILUS (Qwen2.5-VL)	config	Huggingface	log	json

Local Inference

NAUTILUS (LLaVA)

Single-Sample Inference for NAUTILUS (LLaVA)

cd LLaVA
CUDA_VISIBLE_DEVICES=0 python scripts/inference/inference.py --model-path "path to checkpoint" --dinov2-weight "path to dinov2" --image "path to image" --prompt "question"
# prompt default is "Describe the image"

NAUTILUS (Qwen)

Single-Sample Inference for NAUTILUS (Qwen)

cd qwen-vl-finetune
CUDA_VISIBLE_DEVICES=0 python scripts/inference.py --checkpoint "path to checkpoint" --image "path to image" --prompt "question"
# prompt default is "Describe the image"

Evaluation

We provide a simple evaluation script for evaluating the performance of NAUTILUS.

# for evaluation
bash eval/eval_nautilus_llava.sh # for Nautilus (LLaVA)
bash eval/eval_nautilus_qwen.sh # for Nautilus (Qwen)

Acknowledgement

Sincere acknowledgment to the amazing open-source community for their great contributions:

LLaVA: for the excellent work and inspiring codebase.
Qwen-2.5-VL: for the powerful open-source model and codebase (now updated to Qwen-3-VL).
Depth AnythingV2: for the impressive work and for sharing their model weights with the community.

Citation

If you find this repository useful in your research, please consider giving a star ⭐ and a citation.

@inproceedings{xu2025nautilus,
        title={NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding},
        author={Xu, Wei and Wang, Cheng and Liang, Dingkang and Zhao, Zongchuang and Jiang, Xingyu and Zhang, Peng and Bai, Xiang},
        booktitle={Advances in Neural Information Processing Systems},
        year={2025}
  }

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Figs		Figs
LLaVA		LLaVA
eval		eval
logs		logs
qwen-vl-finetune		qwen-vl-finetune
utils		utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

News

Contents

Introduction

Contributions

Pipeline

NAUTILUS Performance

Install

Dataset

For details on NautData and `training / evaluation` preparation, please refer to NautData.

Train

Dataset Preparation and Preprocessing

NAUTILUS (LLaVA)

NAUTILUS (Qwen)

Model Weights

Local Inference

NAUTILUS (LLaVA)

NAUTILUS (Qwen)

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases 1

Packages

Contributors 2

Languages

H-EmbodVis/NAUTILUS

Folders and files

Latest commit

History

Repository files navigation

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

News

Contents

Introduction

Contributions

Pipeline

NAUTILUS Performance

Install

Dataset

For details on NautData and training / evaluation preparation, please refer to NautData.

Train

Dataset Preparation and Preprocessing

NAUTILUS (LLaVA)

NAUTILUS (Qwen)

Model Weights

Local Inference

NAUTILUS (LLaVA)

NAUTILUS (Qwen)

Evaluation

Acknowledgement

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

For details on NautData and `training / evaluation` preparation, please refer to NautData.

Packages