Skip to content

jqtangust/Robust-U1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

ICML 2026 Β |Β  Official Implementation

Jiaqi Tangβ˜…, Jianmin Chenβ˜…, Youyang Zhaiβ˜…, Wei Wei‑, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao,

Qifeng Chen†

β˜… Equal contribution Β Β  † Corresponding author Β Β  ‑ Co-corresponding author


Paper arXiv Models Demo License: MIT Stars

TL;DR β€” Robust-U1 is a unified MLLM that self-recovers corrupted visual content and reasons over it, enabling robust visual understanding under real-world image degradations.


πŸ“° News

  • 2026-06-11 πŸ”₯ We release the code, pretrained models, and the online demo of Robust-U1!
  • 2026-05-07 πŸŽ‰ Robust-U1 is accepted to ICML 2026!

πŸ“‘ Table of Contents

πŸ”­ Motivation Β Β·Β  πŸ“¦ Installation Β Β·Β  πŸ€– Models Β Β·Β  πŸ’» Demo Β Β·Β  🧠 Training Β Β·Β  πŸ“Š Evaluation Β Β·Β  ⭐ Citation Β Β·Β  πŸ“¬ Contact


πŸ”­ Motivation

Existing approaches to robust visual understanding face two key limitations:

  • 🚩 Black-Box Alignment β€” Feature-alignment methods lack interpretability and fail to explicitly model the corruption process.
  • 🚩 Text-Only Compensation β€” Text-based reasoning cannot recover lost pixel-level visual details for faithful visual understanding.

This motivates a key question: Can MLLMs recover corrupted visual content by themselves?

Motivation Overview

πŸ“¦ Installation

1. Clone the repository

git clone https://github.com/jqtangust/Robust-U1.git
cd Robust-U1

2. Create the environment

conda create -n Robust-U1 python=3.10
conda activate Robust-U1
pip install -r requirements.txt
pip install -e .

πŸ€– Models

Model Link Description
BAGEL-7B-MoT ByteDance-Seed/BAGEL-7B-MoT Base model used as the initial weights for training.
Robust-U1 Jiaqi-hkust/Robust-U1 Final model for visual self-recovery and multimodal reasoning.
Robust-U1-SFT Jiaqi-hkust/Robust-U1-SFT Stage-I supervised fine-tuned checkpoint.
Robust-U1-RL Jiaqi-hkust/Robust-U1-RL Stage-II reinforcement-learning checkpoint.

πŸ’» Demo

🌐 Online demo β€” try Robust-U1 directly on Hugging Face Spaces.

πŸ–₯️ CLI

Run the command-line demo with a local model path and an output directory for recovered images:

export MODEL_PATH="/path/to/Robust-U1"
export OUTPUT_DIR="./outputs"

python demo.py \
  --model-path "$MODEL_PATH" \
  --output-dir "$OUTPUT_DIR"

πŸͺŸ GUI

Set the model path and start the local Gradio demo (available at http://localhost:7860 by default):

export MODEL_PATH="/path/to/Robust-U1"
python app.py --model-path "$MODEL_PATH"
Robust-U1 Demo

🧠 Training

Robust-U1 is trained with a three-stage pipeline:

Stage Goal Framework
I. Visual Self-Recovery Recover clean images from corrupted inputs (SFT) MathCanvas
II. Visual Quality Alignment Align recovery with pixel-level fidelity & semantics (RL) Flow-GRPO
III. Multimodal Reasoning Reason over corrupted & recovered images MathCanvas

πŸŽ“ Stage I & III β€” Self-Recovery & Reasoning

We use MathCanvas for both supervised fine-tuning and multimodal reasoning training. Stage I adapts the base unified MLLM to recover clean images from corrupted inputs, while Stage III trains the model to reason over both corrupted and recovered images.

  1. Prepare the MathCanvas training framework:

    git clone https://github.com/shiwk24/MathCanvas.git
    cd MathCanvas/BAGEL-Canvas
  2. Download the base model BAGEL-7B-MoT.

  3. Prepare the training data:

    • For Stage I, prepare paired corrupted-clean image data for visual self-recovery.
    • For Stage III, prepare reasoning data with corrupted images, recovered images, questions, and reasoning-chain annotations.
  4. Modify the dataset paths in data/dataset_info.py and configure the corresponding training scripts with your local paths.

  5. Run Stage-I supervised fine-tuning to obtain the SFT checkpoint:

    bash scripts/train/stage1.sh
  6. After Stage-II reinforcement learning, run Stage-III multimodal reasoning training:

    bash scripts/train/stage2.sh

πŸŽ“ Stage II β€” Visual Quality Alignment (RL)

We use Flow-GRPO to further align the recovery model with pixel-level structural fidelity and semantic consistency. The Robust-U1 rewards are packaged in rewards/ and can be registered directly in Flow-GRPO.

  1. Prepare Flow-GRPO and expose Robust-U1 rewards:

    git clone https://github.com/yifan123/flow_grpo.git
    cd flow_grpo
  2. Register the Robust-U1 reward adapter in flow_grpo/rewards.py:

    from rewards import FLOW_GRPO_REFERENCE_REWARD_NAMES, register_flow_grpo_rewards
    
    # after Flow-GRPO builds score_functions
    register_flow_grpo_rewards(score_functions)
    
    # reference-based rewards use clean target images
    elif score_name in FLOW_GRPO_REFERENCE_REWARD_NAMES:
        scores, rewards = score_fns[score_name](images, ref_images)
  3. Prepare restoration data with corrupted images and clean references. Each JSONL record should contain:

    {"prompt": "Please restore this corrupted image to its clean version.", "image": "corrupted/000001.png", "target_image": "clean/000001.png"}
  4. Configure config/grpo.py:

    config.dataset = "/path/to/dataset/restoration"
    config.pretrained.model = "/path/to/Robust-U1-SFT"
    config.reward_fn = {
        "restoration": 1.0,
        "tinyclip": 0.2,
    }
  5. Run reinforcement learning:

    bash scripts/multi_node/bagel/main.sh 0

    The launcher should point to the restoration config, for example:

    accelerate launch --config_file scripts/accelerate_configs/fsdp.yaml \
      --num_processes 8 \
      scripts/train_bagel.py \
      --config config/grpo.py:restoration_bagel

πŸ“Š Evaluation

We use VLMEvalKit for anti-degradation evaluation.

  1. Clone the VLMEvalKit repository and install dependencies:

    git clone https://github.com/open-compass/VLMEvalKit.git
    cd VLMEvalKit
    pip install -e .
  2. Prepare the evaluation datasets according to VLMEvalKit requirements.

  3. Image Degradation Pipeline β€” generate corrupted images for robustness evaluation.

    Navigate to the degradation pipeline directory and process images:

    cd add_degradation
    python generate_pipeline_open_source.py --input_dir <input_dir> --output_base_dir <output_base_dir> --dataset_name <dataset_name> --verbose

    The script will generate three output directories with different degradation intensities for each image.

  4. Configure the model path and evaluation settings in the VLMEvalKit configuration file.

  5. Run the evaluation command:

    python run.py --model <your_model_name_or_path> --data <dataset_name>

πŸ”¬ R-Bench Evaluation

For R-Bench evaluation, we use R-Bench to assess model performance under real-world corruptions.

  1. Clone the R-Bench repository:

    git clone https://github.com/Q-Future/R-Bench.git
  2. Evaluate using VLMEvalKit with the R-Bench dataset:

    cd VLMEvalKit
    python run.py --data R-Bench-Dis --model <your_model_name_or_path> --verbose
  3. For full dataset evaluation, follow the R-Bench pipeline as described in the R-Bench repository.


⭐ Citation

If you find this repository useful, please consider citing our paper:

@inproceedings{tang2026robustu1,
      title={Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?},
      author={Tang, Jiaqi and Chen, Jianmin and Zhai, Youyang and Wei, Wei and Liu, Runtao and Zhao, Mengjie and Wu, Xiangyu and Xiao, Qingfa and Chen, Qifeng},
      booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
      year={2026},
}

πŸ“¬ Contact

For questions about the paper or code, feel free to open a GitHub issue or reach out:


🀝 Acknowledgements

We thank the authors of BAGEL, MathCanvas, and Flow-GRPO for their excellent open-source contributions.

About

πŸš€πŸš€πŸš€ [ICML 2026] Official Implementation of Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages