Project Webpage, Paper, Datasets
Official implementation of my work on PixFoundation direction.
PixMMVP and PixCV-Bench augment recent benchmarks with referring expression annotations and their corresponding segmentation. These are paired with the object of interest in the question within their original visual question answering task. The goal is to evaluate the pixel-level visual grounding and visual question answering capabilities of recent pixel-level MLLMs, e.g.,OMG-Llava, Llava-G, GLAMM and LISA. We also provide an interpretability mechanism for MLLMs that understands when visual grounding occurs w.r.t the output tokens, through using MLLM as a Judge on the output segmentation.
- Run evaluation script after modifying it to the models needed, it includes two examples:
bash pixmmvp/scripts/run_all.sh
- Each of the pixel-level MLLMs inference code is based on their respective gradio demo codes not customized for a certain task.
- This a repository that includes example visualizations used in the automatic selection for LLaVA variants
git clone https://github.com/MSiam/AutoGPTImages
- Run the following standalone script:
python interpretability_demo/demo.py --image_path image2.jpg --ref_expr "the hands holding the hat" --openai_api_key API_KEY
python interpretability_demo/demo.py --image_path image1.jpeg --ref_expr "the closed kitten's eyes" --openai_api_key API_KEY
- First example output:
| Noun Phrase | Image #1 | Image #2 | Image #3 |
|---|---|---|---|
| The hands | ![]() |
![]() |
![]() |
| the hat | ![]() |
![]() |
![]() |
| the scene | ![]() |
![]() |
![]() |
| a pair | ![]() |
![]() |
![]() |
| human hands | ![]() |
![]() |
![]() |
- Second example output:
| Noun Phrase | Image #1 | Image #2 | Image #3 |
|---|---|---|---|
| the image | ![]() |
![]() |
![]() |
| three kittens | ![]() |
![]() |
![]() |
| them | ![]() |
![]() |
![]() |
| its eyes | ![]() |
![]() |
![]() |
Our finding is that grounding can emerge coinciding with output text that describes the object of interest in terms of color, location or state and not necessarily the exact output text of this object. We persistently find the most frequent emergence occurs in the last 40-60% of the output text in MLLMs not trained with pixel-level grounding supervision (e.g., Llava 1.5 & Cambrian-1). We also show a histogram of the concept categories of the output text that coincides with the best segmentation emerging in such MLLMs.
These repositories were used as part of our work:
Please cite my paper if you find it useful in your research
@article{siam2025pixfoundation,
title={PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?},
author={Siam, Mennatullah},
journal={arXiv preprint arXiv:2502.04192},
year={2025}
}


































