Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Imad Eddine MAROUF¹ Enzo Tartaglione¹ Stéphane Lathuilière¹ Joost van de Weijer²

¹Telecom-Paris, Institut Polytechnique de Paris, France

²Computer Vision Center (CVC), Barcelona, Spain

The code repository for "Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering" in PyTorch.

📣 Published as a conference paper at ICCV 2025

Abstract

Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement.

In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the answer space of the current task, addressing the problem out of answer set. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations.

Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA.

Getting Started

Prerequisites

Python 3.7+
PyTorch 1.6.0+
CUDA 10.2 or higher (for GPU support)
16GB+ RAM recommended
GPU with 11GB+ VRAM recommended for training

Installation

Option 1: Using Conda (Recommended)

# Clone the repository
git clone https://github.com/yourusername/QUAD.git
cd QUAD

# Create conda environment from environment.yml
conda env create -f environment.yml
conda activate vqa

# Install additional dependencies
pip install -r requirements.txt

# Download pretrained backbone models (T5-base, BART-base)
python download_backbones.py

Option 2: Using pip

# Clone the repository
git clone https://github.com/yourusername/QUAD.git
cd QUAD

# Create a virtual environment
python3 -m venv quad_env
source quad_env/bin/activate  # On Windows: quad_env\Scripts\activate

# Install dependencies
pip install torch==1.6.0 torchvision==0.7.0
pip install -r requirements.txt

# Download pretrained backbone models
python download_backbones.py

Dataset Preparation

Dataset Preparation / Model checkpoint

Download the VQACL partition of VQA v2 from Google Drive and put it into datasets/nextqa/Partition_Q.
Download the VQACL partition of NExT-QA from Google Drive and put it into datasets/vqa/Partition_Q.
Download datasets/COCO from Google Drive
Download video features of NExT-QA from Goolge Drive and put it into datatsets/nextqa/

Directory structure after preparation:

datasets/
├── COCO/
│   ├── images/
│   │   ├── train2014/
│   │   └── val2014/
│   └── features/
│       ├── train2014_obj36.h5
│       └── val2014_obj36.h5
├── vqa/
│   └── Partition_Q/
│       ├── G-1/
│       ├── G-2/
│       └── ...
└── nextqa/
    ├── Partition_Q/
    │   ├── G-1/
    │   └── ...
    └── video_features/

Repository Structure

QUAD/
├── VL-T5/                          # Main codebase
│   ├── src/                        # VQAv2 experiments
│   │   ├── models/                 # Model implementations
│   │   │   ├── ours.py             # QUAD model
│   │   │   ├── ours_ce_buffer.py   # QUAD with CE buffer
│   │   │   └── vqacl.py            # Baseline methods
│   │   ├── backbones/              # Backbone architectures
│   │   │   ├── modeling_t5_our.py  # Modified T5 for QUAD
│   │   │   └── ...
│   │   ├── vqa_data.py             # VQA data loading
│   │   ├── vqa_model.py            # VQA model base
│   │   └── param.py                # Configuration parameters
│   ├── nextqa/                     # NExT-QA experiments
│   │   ├── models/                 # NExT-QA models
│   │   │   ├── ours.py             # QUAD for NExT-QA
│   │   │   ├── feat_dist.py        # Feature distillation baseline
│   │   │   └── ...
│   │   ├── scripts/                # Training scripts
│   │   │   └── ours.sh             # QUAD training script
│   │   └── nextqa_data.py          # NExT-QA data loading
│   └── scripts/                    # VQAv2 training scripts
│       ├── VQACL_train.sh          # Standard training
│       ├── VQACL_COMP_train.sh     # Composition training
│       ├── vqacl.sh                # Standard evaluation
│       └── VQACL_COMP.sh           # Composition evaluation
├── feature_extraction/             # Feature extraction tools
│   ├── README.md                   # Feature extraction guide
│   ├── coco_proposal.py            # COCO feature extraction
│   └── ...
├── datasets/                       # Dataset storage (not in repo)
├── requirements.txt                # Python dependencies
├── environment.yml                 # Conda environment
├── download_backbones.py           # Download pretrained models
└── Question_type.py                # Question type utilities

Training

VQAv2 Training

Standard Continual Learning Training

cd VL-T5

# Train QUAD on VQAv2 with 1 GPU
bash scripts/VQACL_train.sh 1

# Train with multiple GPUs (e.g., 4 GPUs)
bash scripts/VQACL_train.sh 4

Novel Composition Training

For evaluating compositional generalization (holdout one group):

cd VL-T5

# Train for composition testing (Group-1 held out)
bash scripts/VQACL_COMP_train.sh 1

NExT-QA Training

cd VL-T5/nextqa

# Train QUAD on NExT-QA
bash scripts/ours.sh

# Other baseline methods
bash scripts/buffer_ce.sh     # Buffer with CE loss
bash scripts/feat_dist.sh     # Feature distillation
bash scripts/mas.sh           # Memory Aware Synapses

Training Parameters

Key hyperparameters in the training scripts:

Parameter	Description	Default Value
`--m_size`	Memory buffer size (number of questions)	5000
`--lwf_lambda`	Weight for attention distillation loss	0.1
`--weight_ce`	Weight for question-only replay CE loss	0.1
`--type_dist`	Distillation type	`l1_reg_attention`
`--backbone`	Backbone model architecture	`t5-base`
`--batch_size`	Training batch size	40
`--epochs`	Number of epochs per task	3
`--lr`	Learning rate	1e-4
`--comp_cate`	Composition category (for comp. eval)	`G-1`

Distillation types:

l1_reg_attention: L1 regularization on attention weights
ce_l1_reg_attention: Combined CE loss with L1 attention regularization
asymetric_reg_attention: Asymmetric attention regularization

To customize training, edit the corresponding shell script in VL-T5/scripts/ or VL-T5/nextqa/scripts/.

Evaluation

Standard Testing

VQAv2 Evaluation

cd VL-T5

# Evaluate on standard continual learning protocol
bash scripts/vqacl.sh 1

The script will:

Load trained checkpoints from VL-T5/snap/
Evaluate on each task sequentially
Report average accuracy, forgetting metrics, and per-task performance

NExT-QA Evaluation

cd VL-T5/nextqa

# Evaluate trained model
python eval_nextqa_CL.py \
    --checkpoint snap/ours/BEST \
    --test karpathy_test

Novel Composition Testing

To evaluate compositional generalization (novel skill-concept combinations):

cd VL-T5

# Test on held-out composition group
bash scripts/VQACL_COMP.sh 1

This evaluates the model's ability to answer questions with visual concepts not seen during training for specific question types.

Metrics

The evaluation reports:

Average Accuracy: Overall performance across all tasks
Forgetting: Performance degradation on previous tasks
Composition Accuracy: Performance on novel combinations

Citation

If you use this codebase in your research, please cite our paper:

@inproceedings{marouf2025askrememberquestionsonlyreplay,
  title={Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering},
  author={Marouf, Imad Eddine and Tartaglione, Enzo and Lathuilière, Stéphane and van de Weijer, Joost},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}

Acknowledgments

This codebase is heavily based on the VQACL framework by Zhang et al. (CVPR 2023). We extend our sincere gratitude to the authors for their foundational work on Visual Question Answering Continual Learning and for making their code publicly available.

We also acknowledge the following projects that contributed to this work:

VL-T5: Vision-and-Language Transformer by Cho et al.
VQAv2: Visual Question Answering Dataset by Goyal et al.
NExT-QA: Next Question Answering Dataset by Xiao et al.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or issues, please:

Open an issue on GitHub
Contact: Imad Eddine Marouf

Note: This is a research project. If you encounter any issues or have suggestions for improvements, we welcome contributions through pull requests or issue reports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Table of Contents

Abstract

Getting Started

Prerequisites

Installation

Option 1: Using Conda (Recommended)

Option 2: Using pip

Dataset Preparation

Dataset Preparation / Model checkpoint

Repository Structure

Training

VQAv2 Training

Standard Continual Learning Training

Novel Composition Training

NExT-QA Training

Training Parameters

Evaluation

Standard Testing

VQAv2 Evaluation

NExT-QA Evaluation

Novel Composition Testing

Metrics

Citation

Acknowledgments

License

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
VL-T5		VL-T5
assets		assets
feature_extraction		feature_extraction
.gitignore		.gitignore
Question_type.py		Question_type.py
README.md		README.md
cog.yaml		cog.yaml
download_backbones.py		download_backbones.py
environment.yml		environment.yml
requirements.txt		requirements.txt

IemProg/QUAD

Folders and files

Latest commit

History

Repository files navigation

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

Table of Contents

Abstract

Getting Started

Prerequisites

Installation

Option 1: Using Conda (Recommended)

Option 2: Using pip

Dataset Preparation

Dataset Preparation / Model checkpoint

Repository Structure

Training

VQAv2 Training

Standard Continual Learning Training

Novel Composition Training

NExT-QA Training

Training Parameters

Evaluation

Standard Testing

VQAv2 Evaluation

NExT-QA Evaluation

Novel Composition Testing

Metrics

Citation

Acknowledgments

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages