MuKA is intended to leverage multimodal passages into the knowledge retrieval and answer generation processes of a RAG pipeline to answer the visual information-seeking questions.
We leveraged the M2KR suite for our experiments, which was raised along with PreFLMR models by LinWeizheDragon/FLMR.
The train split of InfoSeek is sub-sampled to meet the statistics: data.shuffle(seed=42).select(range(100000)).
The urls for the entity images are provided in the data_preparation folder. If image_url is null, a black image is used as the placeholder. Large images were scaled proportionally till the shorter side is 512px.
After downloading the entity images, a index json is required for the following scripts, which contains title: image_path key-value pairs. The titles are provided in the urls file.
The
titleis actually thetitlecolumn for the InfoSeek passages in M2KR andpassage_idfor EVQA. Change thetitle_keyaccordingly in the scripts.
The knowledge retrieval process is intended to provide retrieval results for building examples to train/test answer generators.
Following the instructions of FLMR/how-to-use-this-package to clone and install FLMR first. We recommend to install it as editable using pip -e for development purposes.
Our modifications are provided in diffs. flmr_diffs_mask folder is for the MuKA retriever, flmr_diffs_nomask refer to the MuKA retriever without mask. The mask is only implemented for the PreFLMR-G model, modify accordingly for other models.
We implemented the training in preflmr_train.py using the huggingface trainer, which is paired with the preflmr_train.sh script.
For indexing and testing, please refer to preflmr_build_index.py, which is pair with the preflmr_test.sh script.
The
preflmr_build_index.pywas adapted fromexamples/example_use_preflmr.pyfrom the LinWeizheDragon/FLMR repo.
The answer generators are first trained on the reading examples and then tested. A reading example provides the question, retrieved documents and a short instruction for the model to generate an answer.
The reading examples for training are derived from the same knowledge retrieval results across models, for a fair comparison. Such retrieval results for training are obtained from a zero-shot inference of the PreFLMR-G model with its official codebase.
The LLaVA-1.5 models are trained to leverage a single image as the visual context.
We trained LLaVA-1.5 models using the official LoRA fine-tuning script finetune_lora.sh, and tested with the official eval script model_vqa_loader.py.
The VILA-1.5 models are trained to handle visual contexts with multiple images.
Following the VILA-1.5/Installation instructions to clone and install first. We recommend to use a virtual environment since it does modifications to the transformers package. We recommend to install it as editable using pip -e for development purposes.
Since the official VILA-1.5 codebase does not provide an official script for LoRA fine-tuning, we implemented on our side for this purpose. Please refer to the vila_diffs folder.
Please refer to vila_train.py, which is paired with vila_train_lora.sh.
Please refer to vila_model_vqa_loader.py, which is paired with vila_eval_lora.sh.
To test VILA with multiple images, use <image> as the placeholder in the text, and provide the image paths in a list.
We extend our sincere thanks for the authors who created the resources aforementioned, which made it possible for this project.
The images we have collected are for research purposes only, and we shall not be held responsible for any issues arising from their use.
The scripts are cleaned but not tested, please check them before the run, and we shall not be held responsible for any issues arising from their use.