Skip to content

MSiam/PixFoundation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Project Webpage, Paper, Datasets

Official implementation of my work on PixFoundation direction.



Benchmarking Pixel-level MLLMs on PixMMVP & PixCV-Bench

PixMMVP and PixCV-Bench augment recent benchmarks with referring expression annotations and their corresponding segmentation. These are paired with the object of interest in the question within their original visual question answering task. The goal is to evaluate the pixel-level visual grounding and visual question answering capabilities of recent pixel-level MLLMs, e.g.,OMG-Llava, Llava-G, GLAMM and LISA. We also provide an interpretability mechanism for MLLMs that understands when visual grounding occurs w.r.t the output tokens, through using MLLM as a Judge on the output segmentation.



Dataset Setup

Data

Installation

Installation

PixMMVP Evaluation

  • Run evaluation script after modifying it to the models needed, it includes two examples:
bash pixmmvp/scripts/run_all.sh
  • Each of the pixel-level MLLMs inference code is based on their respective gradio demo codes not customized for a certain task.
  • This a repository that includes example visualizations used in the automatic selection for LLaVA variants
git clone https://github.com/MSiam/AutoGPTImages

Demo Interpretability Mechanism

  • Run the following standalone script:
python interpretability_demo/demo.py --image_path image2.jpg --ref_expr "the hands holding the hat" --openai_api_key API_KEY
python interpretability_demo/demo.py --image_path image1.jpeg --ref_expr "the closed kitten's eyes" --openai_api_key API_KEY
  • First example output:
Noun Phrase Image #1 Image #2 Image #3
The hands
the hat
the scene
a pair
human hands
  • Second example output:
Noun Phrase Image #1 Image #2 Image #3
the image
three kittens
them
its eyes

Changes to Evaluation Protocols

Changes

When does grounding emerge in MLLMs?

Our finding is that grounding can emerge coinciding with output text that describes the object of interest in terms of color, location or state and not necessarily the exact output text of this object. We persistently find the most frequent emergence occurs in the last 40-60% of the output text in MLLMs not trained with pixel-level grounding supervision (e.g., Llava 1.5 & Cambrian-1). We also show a histogram of the concept categories of the output text that coincides with the best segmentation emerging in such MLLMs.







Acknowledgements

These repositories were used as part of our work:

References

Please cite my paper if you find it useful in your research

@article{siam2025pixfoundation,
  title={PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?},
  author={Siam, Mennatullah},
  journal={arXiv preprint arXiv:2502.04192},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors