PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Official implementation of my work on PixFoundation direction.

Benchmarking Pixel-level MLLMs on PixMMVP & PixCV-Bench

PixMMVP and PixCV-Bench augment recent benchmarks with referring expression annotations and their corresponding segmentation. These are paired with the object of interest in the question within their original visual question answering task. The goal is to evaluate the pixel-level visual grounding and visual question answering capabilities of recent pixel-level MLLMs, e.g.,OMG-Llava, Llava-G, GLAMM and LISA. We also provide an interpretability mechanism for MLLMs that understands when visual grounding occurs w.r.t the output tokens, through using MLLM as a Judge on the output segmentation.

Dataset Setup

Data

Installation

PixMMVP Evaluation

Run evaluation script after modifying it to the models needed, it includes two examples:

bash pixmmvp/scripts/run_all.sh

Each of the pixel-level MLLMs inference code is based on their respective gradio demo codes not customized for a certain task.
This a repository that includes example visualizations used in the automatic selection for LLaVA variants

git clone https://github.com/MSiam/AutoGPTImages

Demo Interpretability Mechanism

Run the following standalone script:

python interpretability_demo/demo.py --image_path image2.jpg --ref_expr "the hands holding the hat" --openai_api_key API_KEY
python interpretability_demo/demo.py --image_path image1.jpeg --ref_expr "the closed kitten's eyes" --openai_api_key API_KEY

First example output:

Noun Phrase	Image #1	Image #2	Image #3
The hands
the hat
the scene
a pair
human hands

Second example output:

Noun Phrase	Image #1	Image #2	Image #3
the image
three kittens
them
its eyes

Changes to Evaluation Protocols

Changes

When does grounding emerge in MLLMs?

Our finding is that grounding can emerge coinciding with output text that describes the object of interest in terms of color, location or state and not necessarily the exact output text of this object. We persistently find the most frequent emergence occurs in the last 40-60% of the output text in MLLMs not trained with pixel-level grounding supervision (e.g., Llava 1.5 & Cambrian-1). We also show a histogram of the concept categories of the output text that coincides with the best segmentation emerging in such MLLMs.

Acknowledgements

These repositories were used as part of our work:

References

Please cite my paper if you find it useful in your research

@article{siam2025pixfoundation,
  title={PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?},
  author={Siam, Mennatullah},
  journal={arXiv preprint arXiv:2502.04192},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
LISA @ 0279a7b		LISA @ 0279a7b
LLaVA @ c121f04		LLaVA @ c121f04
LLaVA-Grounding @ 5106102		LLaVA-Grounding @ 5106102
OMG-Seg @ 114c7ea		OMG-Seg @ 114c7ea
Qwen3-VL @ e5c7e5c		Qwen3-VL @ e5c7e5c
RGA3-release @ fe53b7d		RGA3-release @ fe53b7d
cambrian @ 539ffc3		cambrian @ 539ffc3
data		data
groundLMM @ c3876df		groundLMM @ c3876df
groundingLMM @ 4073365		groundingLMM @ 4073365
imgs		imgs
interpretability_demo		interpretability_demo
pixcvbench/dataset		pixcvbench/dataset
pixmmvp		pixmmvp
.gitignore		.gitignore
.gitmodules		.gitmodules
Changes.md		Changes.md
Data.md		Data.md
ICML25_PixFoundation-2.pdf		ICML25_PixFoundation-2.pdf
INSTALL.md		INSTALL.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Benchmarking Pixel-level MLLMs on PixMMVP & PixCV-Bench

Dataset Setup

Installation

PixMMVP Evaluation

Demo Interpretability Mechanism

Changes to Evaluation Protocols

When does grounding emerge in MLLMs?

Acknowledgements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Benchmarking Pixel-level MLLMs on PixMMVP & PixCV-Bench

Dataset Setup

Installation

PixMMVP Evaluation

Demo Interpretability Mechanism

Changes to Evaluation Protocols

When does grounding emerge in MLLMs?

Acknowledgements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages