International Conference on Computer Vision, ICCV 2025.
2025.09🌟 We released SC-Captioner, a reinforcement learning method to improve image captioning with self-correction.
We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
Our code is built upon a specific version of LLaMA-Factory and based on Qwen2-VL. Other versions are not tested.
git clone https://github.com/zl2048/SC-Captioner.git
cd SC-Captioner
conda create -n xxx python=3.10
conda activate xxxThen install Pytorch.
pip install transformers==4.45.0
pip install -e .pip install openai==1.45.0
pip install trl==0.12.0 --no-deps
pip install capture_metric rouge-chinese jiebaDownload images from DOCCI and COCO
Training and testing labels for DOCCI, RefinedCaps (coco6k), DOCCI500, COCO-LN500 can be processed as follows:
- Download json file from The Huggingface space
- Rewrite the image path in the json file using
process_json.py(please change the "json_file_path" and "image_directory" inprocess_json.py) - Put it under the
data/folder
Download checkpoints of Qwen2-VL and change the "model_name_or_path" in config file, then
llamafactory-cli train config/qwen2vl_train_lora_sft.yamlllamafactory-cli export config/qwen2vl_merge.yamlllamafactory-cli train config/qwen2vl_train_lora_sc.yamlllamafactory-cli train config/qwen2vl_test_lora_sc_docci500.yaml./run_metrics_docci500.sh saves/eval_qwen2vl/sc/docci500For relation evaluation, due to the way of API use, we only provide questions.
This repo benefits from LLaMA-Factory, Qwen2-VL, and TRL. Thanks for their wonderful works.
If you find the provided code or models useful for your research, consider citing them as:
@InProceedings{zhang2025sc,
author = {Zhang, Lin and Zeng, Xianfang and Li, Kangcong and Yu, Gang and Chen, Tao},
title = {SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {23145-23155}
}
