2Computer Vision Center (CVC), Barcelona, Spain
The code repository for "Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering" in PyTorch.
📣 Published as a conference paper at ICCV 2025
- Abstract
- Key Features
- Getting Started
- Dataset Preparation
- Repository Structure
- Training
- Evaluation
- Citation
- Acknowledgments
- License
Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement.
In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the answer space of the current task, addressing the problem out of answer set. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations.
Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA.
- Python 3.7+
- PyTorch 1.6.0+
- CUDA 10.2 or higher (for GPU support)
- 16GB+ RAM recommended
- GPU with 11GB+ VRAM recommended for training
# Clone the repository
git clone https://github.com/yourusername/QUAD.git
cd QUAD
# Create conda environment from environment.yml
conda env create -f environment.yml
conda activate vqa
# Install additional dependencies
pip install -r requirements.txt
# Download pretrained backbone models (T5-base, BART-base)
python download_backbones.py# Clone the repository
git clone https://github.com/yourusername/QUAD.git
cd QUAD
# Create a virtual environment
python3 -m venv quad_env
source quad_env/bin/activate # On Windows: quad_env\Scripts\activate
# Install dependencies
pip install torch==1.6.0 torchvision==0.7.0
pip install -r requirements.txt
# Download pretrained backbone models
python download_backbones.py- Download the VQACL partition of VQA v2 from Google Drive and put it into datasets/nextqa/Partition_Q.
- Download the VQACL partition of NExT-QA from Google Drive and put it into datasets/vqa/Partition_Q.
- Download
datasets/COCOfrom Google Drive - Download video features of NExT-QA from Goolge Drive and put it into datatsets/nextqa/
Directory structure after preparation:
datasets/
├── COCO/
│ ├── images/
│ │ ├── train2014/
│ │ └── val2014/
│ └── features/
│ ├── train2014_obj36.h5
│ └── val2014_obj36.h5
├── vqa/
│ └── Partition_Q/
│ ├── G-1/
│ ├── G-2/
│ └── ...
└── nextqa/
├── Partition_Q/
│ ├── G-1/
│ └── ...
└── video_features/
QUAD/
├── VL-T5/ # Main codebase
│ ├── src/ # VQAv2 experiments
│ │ ├── models/ # Model implementations
│ │ │ ├── ours.py # QUAD model
│ │ │ ├── ours_ce_buffer.py # QUAD with CE buffer
│ │ │ └── vqacl.py # Baseline methods
│ │ ├── backbones/ # Backbone architectures
│ │ │ ├── modeling_t5_our.py # Modified T5 for QUAD
│ │ │ └── ...
│ │ ├── vqa_data.py # VQA data loading
│ │ ├── vqa_model.py # VQA model base
│ │ └── param.py # Configuration parameters
│ ├── nextqa/ # NExT-QA experiments
│ │ ├── models/ # NExT-QA models
│ │ │ ├── ours.py # QUAD for NExT-QA
│ │ │ ├── feat_dist.py # Feature distillation baseline
│ │ │ └── ...
│ │ ├── scripts/ # Training scripts
│ │ │ └── ours.sh # QUAD training script
│ │ └── nextqa_data.py # NExT-QA data loading
│ └── scripts/ # VQAv2 training scripts
│ ├── VQACL_train.sh # Standard training
│ ├── VQACL_COMP_train.sh # Composition training
│ ├── vqacl.sh # Standard evaluation
│ └── VQACL_COMP.sh # Composition evaluation
├── feature_extraction/ # Feature extraction tools
│ ├── README.md # Feature extraction guide
│ ├── coco_proposal.py # COCO feature extraction
│ └── ...
├── datasets/ # Dataset storage (not in repo)
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment
├── download_backbones.py # Download pretrained models
└── Question_type.py # Question type utilities
cd VL-T5
# Train QUAD on VQAv2 with 1 GPU
bash scripts/VQACL_train.sh 1
# Train with multiple GPUs (e.g., 4 GPUs)
bash scripts/VQACL_train.sh 4For evaluating compositional generalization (holdout one group):
cd VL-T5
# Train for composition testing (Group-1 held out)
bash scripts/VQACL_COMP_train.sh 1cd VL-T5/nextqa
# Train QUAD on NExT-QA
bash scripts/ours.sh
# Other baseline methods
bash scripts/buffer_ce.sh # Buffer with CE loss
bash scripts/feat_dist.sh # Feature distillation
bash scripts/mas.sh # Memory Aware SynapsesKey hyperparameters in the training scripts:
| Parameter | Description | Default Value |
|---|---|---|
--m_size |
Memory buffer size (number of questions) | 5000 |
--lwf_lambda |
Weight for attention distillation loss | 0.1 |
--weight_ce |
Weight for question-only replay CE loss | 0.1 |
--type_dist |
Distillation type | l1_reg_attention |
--backbone |
Backbone model architecture | t5-base |
--batch_size |
Training batch size | 40 |
--epochs |
Number of epochs per task | 3 |
--lr |
Learning rate | 1e-4 |
--comp_cate |
Composition category (for comp. eval) | G-1 |
Distillation types:
l1_reg_attention: L1 regularization on attention weightsce_l1_reg_attention: Combined CE loss with L1 attention regularizationasymetric_reg_attention: Asymmetric attention regularization
To customize training, edit the corresponding shell script in VL-T5/scripts/ or VL-T5/nextqa/scripts/.
cd VL-T5
# Evaluate on standard continual learning protocol
bash scripts/vqacl.sh 1The script will:
- Load trained checkpoints from
VL-T5/snap/ - Evaluate on each task sequentially
- Report average accuracy, forgetting metrics, and per-task performance
cd VL-T5/nextqa
# Evaluate trained model
python eval_nextqa_CL.py \
--checkpoint snap/ours/BEST \
--test karpathy_testTo evaluate compositional generalization (novel skill-concept combinations):
cd VL-T5
# Test on held-out composition group
bash scripts/VQACL_COMP.sh 1This evaluates the model's ability to answer questions with visual concepts not seen during training for specific question types.
The evaluation reports:
- Average Accuracy: Overall performance across all tasks
- Forgetting: Performance degradation on previous tasks
- Composition Accuracy: Performance on novel combinations
If you use this codebase in your research, please cite our paper:
@inproceedings{marouf2025askrememberquestionsonlyreplay,
title={Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering},
author={Marouf, Imad Eddine and Tartaglione, Enzo and Lathuilière, Stéphane and van de Weijer, Joost},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}This codebase is heavily based on the VQACL framework by Zhang et al. (CVPR 2023). We extend our sincere gratitude to the authors for their foundational work on Visual Question Answering Continual Learning and for making their code publicly available.
We also acknowledge the following projects that contributed to this work:
- VL-T5: Vision-and-Language Transformer by Cho et al.
- VQAv2: Visual Question Answering Dataset by Goyal et al.
- NExT-QA: Next Question Answering Dataset by Xiao et al.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please:
- Open an issue on GitHub
- Contact: Imad Eddine Marouf
Note: This is a research project. If you encounter any issues or have suggestions for improvements, we welcome contributions through pull requests or issue reports.
