SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

International Conference on Computer Vision, ICCV 2025.

🔥 News

2025.09 🌟 We released SC-Captioner, a reinforcement learning method to improve image captioning with self-correction.

Introduction

We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.

Installation

Our code is built upon a specific version of LLaMA-Factory and based on Qwen2-VL. Other versions are not tested.

1. Download, create a conda environment and install PyTorch

git clone https://github.com/zl2048/SC-Captioner.git
cd SC-Captioner
conda create -n xxx python=3.10
conda activate xxx

Then install Pytorch.

2. Install transformers and LLaMA-Factory

pip install transformers==4.45.0
pip install -e .

3. Install other dependencies

pip install openai==1.45.0
pip install trl==0.12.0 --no-deps
pip install capture_metric rouge-chinese jieba

Data preparation

1. Download Images

Download images from DOCCI and COCO

2. Download and Process Labels

Training and testing labels for DOCCI, RefinedCaps (coco6k), DOCCI500, COCO-LN500 can be processed as follows:

Download json file from The Huggingface space
Rewrite the image path in the json file using process_json.py (please change the "json_file_path" and "image_directory" in process_json.py)
Put it under the data/ folder

Training

1. SFT

Download checkpoints of Qwen2-VL and change the "model_name_or_path" in config file, then

llamafactory-cli train config/qwen2vl_train_lora_sft.yaml

2. Merge Base Model and SFT Lora

llamafactory-cli export config/qwen2vl_merge.yaml

3. Train Self-correct

llamafactory-cli train config/qwen2vl_train_lora_sc.yaml

Evaluation

1. Predict Captions

llamafactory-cli train config/qwen2vl_test_lora_sc_docci500.yaml

2. Evaluate Metrics

./run_metrics_docci500.sh saves/eval_qwen2vl/sc/docci500

For relation evaluation, due to the way of API use, we only provide questions.

Acknowledgement

This repo benefits from LLaMA-Factory, Qwen2-VL, and TRL. Thanks for their wonderful works.

Citation

If you find the provided code or models useful for your research, consider citing them as:

@InProceedings{zhang2025sc,
    author    = {Zhang, Lin and Zeng, Xianfang and Li, Kangcong and Yu, Gang and Chen, Tao},
    title     = {SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {23145-23155}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
config		config
data		data
evaluate_cocoln500		evaluate_cocoln500
evaluate_docci500		evaluate_docci500
evaluation		evaluation
examples		examples
src		src
.dockerignore		.dockerignore
.env.local		.env.local
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_llamafactory.md		README_llamafactory.md
README_zh_llamafactory.md		README_zh_llamafactory.md
process_json.py		process_json.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_metrics_cocoln500.sh		run_metrics_cocoln500.sh
run_metrics_docci500.sh		run_metrics_docci500.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

🔥 News

Introduction

Installation

1. Download, create a conda environment and install PyTorch

2. Install transformers and LLaMA-Factory

3. Install other dependencies

Data preparation

1. Download Images

2. Download and Process Labels

Training

1. SFT

2. Merge Base Model and SFT Lora

3. Train Self-correct

Evaluation

1. Predict Captions

2. Evaluate Metrics

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

zl2048/SC-Captioner

Folders and files

Latest commit

History

Repository files navigation

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

🔥 News

Introduction

Installation

1. Download, create a conda environment and install PyTorch

2. Install transformers and LLaMA-Factory

3. Install other dependencies

Data preparation

1. Download Images

2. Download and Process Labels

Training

1. SFT

2. Merge Base Model and SFT Lora

3. Train Self-correct

Evaluation

1. Predict Captions

2. Evaluate Metrics

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages