Generating Captions for Visual Stimuli Out of fMRI Scans
By Yoav Tsoran and Roey Shafran
This repository is part of a final project in the Technion's course 046211 - Deep Learning
This project presents a proposed method for creating a descriptive text of the visual stimuli presented to a subject during an fMRI scan. The method is based on the combination of two previous works, MinD-Vis and ClipCap. Mind-Vis is used to generate meaningful embeddings for fMRI scans, while ClipCap is used to create an image embedding that is used as a prefix to GPT2 pre-trained language model. Our work builds upon these previous methods by using the ClipCap method in an fMRI-to-caption setup and training a simple mapping network between the MinD-Vis fMRI encoder embedding space and the GPT2 embedding space. This approach can help improve our understanding of the brain's visual system and explore potential technological applications.
Outlines
-
After cloning into the repository, please run:
conda env create -f environment.yml conda activate brain-cap -
As an alternative, using pip you can run:
pip install -r requirements.yml
Due to size limits, the data and pretrains folders aren't included in this repository ad need to be downloaded seperately. The data folder include both the fMRI-image datasets used in the MinD-Vis work, the captions for the included images and the checkpoints file for our model. The data folder structure is as follows:
data
├── BOLD5000
│ ├── BOLD5000_GLMsingle_ROI_betas
│ ├── BOLD5000_Stimuli
│ ├── COCO-captions
│ │ └── annotations
│ ├── CSI1_no_duplicates.pth
│ └── ImageNet-captions
│ └── imagenet_captions.json
└── Checkpoints
The MinD-Vis repository provides download links for the fMRI-image datasets. The data.zip file needs to be extracted to this repository data folder as stated above. The COCO dataset captions can be downloaded from the COCO dataset official website. The ImageNet dataset captions can be downloaded from the mlfoundations/imagenet-captions GitHub reposiroty. We also provide a download for our Checkpoints, and the MinD-Vis pretrained encoder. The fMRI_encoder_pretrain_metafile.pth should be copied to the pretrains folder.
- To speed up the training we use only one caption for each training sample and save the preprocessed dataset for faster loading.
- A script for creating the dataset file is provided.
- For example, to create the dataset file only for the first subject (CSI1), as used for our training, please run the following line from the code folder:
python create_dataset_no_dup.py --path ../data/BOLD5000 --save-path ../data/BOLD5000/CSI1_no_duplicates.pth --subjects CSI1 --batch_size 8 - If you saved the BOLD5000 folder at a different location, want to train the model on more subjects or use larger batch size for the preprocess (might help with the script running speed) run the script with the --help flag.
- We trained a MLP architecture between the latent spaces while keeping the fMRI encoder and GPT2 decoder freezed.
- To train the model, use the
train.ipynbnotebook. This notebook allows training our model on the CSI1_no_duplicates dataset from scratch orkeep training from our last Checkpoints. - At the beginning of the notebook, you can change the locations of the different folders if you have downloaded the datasets or checkpoints to a different directory from the repo.
- Example for the training process:
- Chen, Z., Qing, J., Xiang, T., Yue, W., & Zhou, J. (2022). Seeing Beyond the Brain: Masked Modeling Conditioned Diffusion Model for Human Vision Decoding. In arXiv.
- Mokady, R., Hertz, A., & Bermano, A. (2021). Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.



