Skip to content

We propose IAD-R1, a universal post-training framework that enhances Vision-Language Models for industrial anomaly detection through a two-stage training strategy.

Notifications You must be signed in to change notification settings

Yanhui-Lee/IAD-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

We propose IAD-R1, a universal post-training framework that enhances Vision-Language Models for industrial anomaly detection through a two-stage training strategy.

Β Β πŸ“” PaperΒ Β  β”‚ Β Β πŸ€– ModelsΒ Β  β”‚ Β Β πŸ’Ύ DatasetsΒ Β 

πŸ—žοΈ More Experiment Results

  • We further refine the captions in the Expert-AD dataset and retrain the model, obtaining IAD-R1-UPDATE. Compared with IAD-R1, IAD-R1-UPDATE achieves consistently better zero-shot detection performance.
    • IAD-R1-UPDATE (Qwen2.5-VL-Instruct-3B) improves average accuracy by 2.7% over IAD-R1 (Qwen2.5-VL-Instruct-3B).
    • IAD-R1-UPDATE (Qwen2-VL-Instruct-2B) improves average accuracy by 1.8% over IAD-R1 (Qwen2-VL-Instruct-2B).
    • IAD-R1-UPDATE (Qwen2.5-VL-Instruct-7B) improves average accuracy by 1.2% over IAD-R1 (Qwen2.5-VL-Instruct-7B).
    • Additional experimental results are provided in supplementary_results/IAD-R1-UPDATE_Results.
  • We additionally report the zero-shot results of Qwen-VL-MAX on the six evaluation benchmarks; see supplementary_results/Commercial-model_Results/qwen_vl_max.
  • We further include 2-shot and 4-shot results for both the pretrained models and their IAD-R1–finetuned counterparts; see supplementary_results/Few-Shot-Settings_Results.

πŸ”₯ Overview

Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from "Anomaly Perception" to "Anomaly Interpretation". Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, the largest improvement was on the DAGM dataset, with average accuracy 43.3% higher than the 0.5B baseline. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1.

πŸ‘Ύ IAD-R1 Model

LLaVA-OneVision-SI-0.5B Qwen2-VL-Instruct-2B Qwen2.5-VL-Instruct-3B Qwen2.5-VL-Instruct-7B LLaVA-1.5-7B LLaVA-OneVision-SI-7B LLaVA-1.6-8B
IAD-R1 IAD-R1(LLaVA-OneVision-SI-0.5B) IAD-R1(Qwen2-VL-Instruct-2B) IAD-R1(Qwen2.5-VL-Instruct-3B) IAD-R1(Qwen2.5-VL-Instruct-7B) IAD-R1(LLaVA-1.5-7B) IAD-R1(LLaVA-OneVision-SI-7B) IAD-R1(LLaVA-1.6-8B)
IAD-R1-UPDATE IAD-R1-UPDATE(LLaVA-OneVision-SI-0.5B) IAD-R1-UPDATE(Qwen2-VL-Instruct-2B) IAD-R1-UPDATE(Qwen2.5-VL-Instruct-3B) IAD-R1-UPDATE(Qwen2.5-VL-Instruct-7B) IAD-R1-UPDATE(LLaVA-1.5-7B) IAD-R1-UPDATE(LLaVA-OneVision-SI-7B) IAD-R1-UPDATE(LLaVA-1.6-8B)

πŸš€ Get Start

πŸ› οΈ Environment Configuration

# clone our project.
git clone https://github.com/Yanhui-Lee/IAD-R1.git
cd IAD-R1

# build conda env.
conda create -n IAD-R1 python=3.10
conda activate IAD-R1
pip install -r requirements.txt

🧰 Data Preparation

Training data

  • Construct Training Dataset for PA-SFT, SC-GRPO.

For PA-SFT training:

[
  {
      "images": "image_path",
      "messages": [
        {
          "role": "user",
          "content": "<image>\nAre there any defects in the query image?"
        },
        {
          "role": "assistant",
          "content": "<think>[Thinking process]</think><answer>No</answer>"
        }
      ]
  },
  {
      "images": "image_path",
      "messages": [
        {
          "role": "user",
          "content": "<image>\nAre there any defects in the query image?"
        },
        {
          "role": "assistant",
          "content": "<think>[Thinking process]</think><location>[Location]</location><type>[Type]</type><answer>Yes</answer>"
        }
      ]
  }
]

For SC-GRPO training:

[
  {
      "id": "xxx",
      "image": "image_path",
      "problem": "Are there any defects in the query image?",
      "solution": "<answer>No</answer>"
  },
  {
      "id": "xxx",
      "image": "image_path",
      "problem": "Are there any defects in the query image?",
      "solution": "<location>[Location]</location><type>[Type]</type><answer>Yes</answer>"
  }
]
  • We adopt a zero-shot training setup during the training process. If you want to use a one-shot training setup, you can follow the construction method below.

For PA-SFT training:

[
  {
      "images": [
        "ref_image_path",   
        "query_image_path" 
      ],
      "messages": [
        {
          "role": "user",
          "content": "'few-shot prompt' + <image>\n<image>\nAre there any defects in the query image?"
        },
        {
          "role": "assistant",
          "content": "<think>[Thinking process]</think><location>[Location]</location><type>[Type]</type><answer>Yes</answer>"
        }
      ]
  }
]

For SC-GRPO training:

[
  {
      "id": "xxx",
      "image": [
          "ref_image_path", 
          "query_image_path"
      ],
      "problem": "Are there any defects in the query image?",
      "solution": "<location>[Location]</location><type>[Type]</type><answer>Yes</answer>"
  }
]

Don't forget to set '--single_img 0' in scripts (SC-GRPO stage) in order to use the prompts for 1-shot settings.

Inference data

  • Download the test data we compiled and organize it according to the following format.
# Inference images
Industrial_test
β”œβ”€β”€ DAGM/
β”‚   β”œβ”€β”€ Class1
β”‚   └── Class2
β”‚   └── ...
β”œβ”€β”€ DS-MVTec/
β”‚   β”œβ”€β”€ bottle
β”‚   └── cable
β”‚   └── ...
β”œβ”€β”€ DTD/
β”‚   β”œβ”€β”€ Blotchy_099
β”‚   └── Fibrous_183
β”‚   └── ...
β”œβ”€β”€ MPDD/
β”‚   β”œβ”€β”€ bracket_black
β”‚   └── bracket_brown
β”‚   └── ...
β”œβ”€β”€ SDD
β”‚   β”œβ”€β”€ test
β”œβ”€β”€ VisA/
β”‚   β”œβ”€β”€ candle
β”‚   └── capsules
β”‚   └── ...

# Inference json
data/Test/
β”œβ”€β”€ test_DAGM_format.json
β”œβ”€β”€ test_MVTec_format.json
β”œβ”€β”€ test_DTD_format.json
β”œβ”€β”€ test_MPDD_format.json
β”œβ”€β”€ test_SDD_format.json
β”œβ”€β”€ test_VisA_format.json
  • Construct your test dataset.

Prepare meta.json for your dataset

{
    "category1":
    [
        {
            "img_path": "your image path",
            "mask_path": "",
            "cls_name": "object1",
            "specie_name": "good",
            "anomaly": 0
        },
        {
            "img_path": "your image path",
            "mask_path": "",
            "cls_name": "object2",
            "specie_name": "good",
            "anomaly": 0
        }
    ]
}
cd Industrial_test
python convert.py

🧠 Training

Model Preparation

Download these models to your local project(final_model/PretrainοΌ‰.

Perception Activation Supervised Fine-Tuning Stage Scripts

# PA-SFT Stage

## LLaVA-OneVision-SI-0.5B
bash scripts/train/PA_SFT/PA_SFT_LLaVA_OneVision_SI_0.5B.sh

## Qwen2-VL-Instruct
bash scripts/train/PA_SFT/PA_SFT_Qwen_Instruct_2_VL.sh

## Qwen2.5-VL-Instruct-3B
bash scripts/train/PA_SFT/PA_SFT_Qwen_Instruct_2_5_VL_3B.sh

## Qwen2.5-VL-Instruct-7B
bash scripts/train/PA_SFT/PA_SFT_Qwen_Instruct_2_5_VL_7B.sh

## LLaVA-1.5
bash scripts/train/PA_SFT/PA_SFT_LLaVA_1_5.sh

## LLaVA-OneVision-SI-7B
bash scripts/train/PA_SFT/PA_SFT_LLaVA_OneVision_SI_7B.sh

## LLaVA-1.6(LLaVA-next)
bash scripts/train/PA_SFT/PA_SFT_LLaVA_1_6.sh

Structured Control Group Relative Policy Optimization Stage Scripts

# SC-GRPO Stage

## LLaVA-OneVision-SI-0.5B
bash scripts/train/SC_GRPO/SC_GRPO_LLaVA_OneVision_SI_0.5B.sh

## Qwen2-VL-Instruct
bash scripts/train/SC_GRPO/SC_GRPO_Qwen_Instruct_2_VL.sh

## Qwen2.5-VL-Instruct-3B
bash scripts/train/SC_GRPO/SC_GRPO_Qwen_Instruct_2_5_VL_3B.sh

## Qwen2.5-VL-Instruct-7B
bash scripts/train/SC_GRPO/SC_GRPO_Qwen_Instruct_2_5_VL_7B.sh

## LLaVA-1.5
bash scripts/train/SC_GRPO/SC_GRPO_LLaVA_1_5.sh

## LLaVA-OneVision-SI-7B
bash scripts/train/SC_GRPO/SC_GRPO_LLaVA_OneVision_SI_7B.sh

## LLaVA-1.6(LLaVA-next)
bash scripts/train/SC_GRPO/SC_GRPO_LLaVA_1_6.sh

πŸ“Š Inference

  • Commercial Model Inference Scripts.
# Commercial Model Inference

## OpenAI
### GPT 4.1
bash scripts/Inference/Commercial-Inference/ChatGPT_4.1_Inference.sh

### GPT 4.1-mini
bash scripts/Inference/Commercial-Inference/ChatGPT_4.1_mini_Inference.sh

### GPT 4.1-nano
bash scripts/Inference/Commercial-Inference/ChatGPT_4.1_nano_Inference.sh

### GPT 4o
bash scripts/Inference/Commercial-Inference/Chatgpt_4o_Inference.sh

### GPT 4o-mini
bash scripts/Inference/Commercial-Inference/ChatGPT_4o_mini_Inference.sh

## Claude-Sonnet-4
bash scripts/Inference/Commercial-Inference/Claude_Sonnet_4_Inference.sh

## Qwen-VL-MAX
bash scripts/Inference/Commercial-Inference/Qwen_VL_MAX_Inference.sh
  • Pretrain Model Inference Scripts.
# Pretrain Model Inference

## LLaVA-OneVision-SI-0.5B
bash scripts/Inference/Pretrain-Inference/vLLM_LLaVA_OneVision_SI_0_5_B_Inference.sh

# Qwen2-VL-Instruct-2B
bash scripts/Inference/Pretrain-Inference/vLLM_Qwen2_VL_Instruct_2B_Inference.sh

## Qwen2.5-VL-Instruct-3B
bash scripts/Inference/Pretrain-Inference/vLLM_Qwen2_5_VL_Instruct_3B_Inference.sh

## InternVL-2.5-4B
bash scripts/Inference/Pretrain-Inference/vLLM_InternVL_2_5_4B_Inference.sh

## Qwen2.5-VL-Instruct-7B
bash scripts/Inference/Pretrain-Inference/vLLM_Qwen2_5_VL_Instruct_7B_Inference.sh

## LLaVA-OneVision-SI-7B
bash scripts/Inference/Pretrain-Inference/vLLM_LLaVA_OneVision_SI_7B_Inference.sh

## LLaVA-1.5-7B
bash scripts/Inference/Pretrain-Inference/vLLM_LLaVA_1_5_7B_Inference.sh

## LLaVA-1.6(mistral)-8B
bash scripts/Inference/Pretrain-Inference/vLLM_LLaVA_1_6_8B_Inference.sh

## LLaVA-1.5-13B
bash scripts/Inference/Pretrain-Inference/vLLM_LLaVA_1_5_13B_Inference.sh

## LLaVA-1.6-34B
bash scripts/Inference/Pretrain-Inference/vLLM_LLaVA_1_6_34B_Inference.sh

## Qwen2.5-VL-Instruct-72B
bash scripts/Inference/Pretrain-Inference/vLLM_Qwen2_5_VL_Instruct_72B_Inference.sh
  • IAD-R1 Model Inference Scripts.
# IAD-R1 Model Inference 

## LLaVA-OneVision-SI-0.5B
bash scripts/Inference/IAD-R1-Inference/vLLM_LLaVA_OneVision_SI_0_5_B_Inference.sh

# Qwen2-VL-Instruct-2B
bash scripts/Inference/IAD-R1-Inference/vLLM_Qwen2_VL_Instruct_Inference.sh

## Qwen2.5-VL-Instruct-3B
bash scripts/Inference/IAD-R1-Inference/vLLM_Qwen2_5_VL_Instruct_3B_Inference.sh

## Qwen2.5-VL-Instruct-7B
bash scripts/Inference/IAD-R1-Inference/vLLM_Qwen2_5_VL_Instruct_7B_Inference.sh

## LLaVA-1.5-7B
bash scripts/Inference/IAD-R1-Inference/vLLM_LLaVA_1_5_Inference.sh

## LLaVA-OneVision-SI-7B
bash scripts/Inference/IAD-R1-Inference/vLLM_LLaVA_OneVision_SI_7B_Inference.sh

## LLaVA-1.6(mistral)-8B
bash scripts/Inference/IAD-R1-Inference/vLLM_LLaVA_1_6_Inference.sh
  • Open-source Model Inference Script.
# Open-source Model Inference

## Anomaly-R1
bash scripts/Inference/Anomaly-R1-Inference/Anomaly-R1-Inference.sh

About

We propose IAD-R1, a universal post-training framework that enhances Vision-Language Models for industrial anomaly detection through a two-stage training strategy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published