Sai Akhil Kogilathota1 Sripadha Vallabha E G1 Luzhe Sun2 Jiawei Zhou1
1Stony Brook University 2Toyota Technological Institute at Chicago
saiakhil.kogilathota@stonybrook.edu | sripadhavallab.eg@stonybrook.edu | luzhesun@ttic.edu | jiawei.zhou.1@stonybrook.edu
HALP predicts whether a Vision-Language Model will hallucinate before generating a single token by probing internal representations. Using lightweight MLP probes on pre-generation features, we achieve up to 0.93 AUROC across 8 modern VLMs including Gemma-3, Phi-4-VL, LLaVA, and Llama-3.2-Vision—enabling real-time risk assessment without costly decoding.
| Contribution | |
|---|---|
| Pre-Generation Detection | Detect hallucination risk from internal VLM states in a single forward pass—no token generation required |
| Three Probe Types | Systematically analyze Visual Features (VF), Vision Tokens (VT), and Query Tokens (QT) across decoder layers |
| 8 State-of-the-Art VLMs | Comprehensive evaluation on Gemma-3-12B, Phi-4-VL, LLaVA-Next, Molmo, Qwen2.5-VL, Llama-3.2-Vision, SmolVLM, and FastVLM |
| Diverse Benchmark | 10,000-sample dataset from 6 established benchmarks covering object, attribute, relationship, and reasoning hallucinations |
HALP extracts three types of internal representations from a single forward pass:
| Representation | Symbol | Description | Extraction Point |
|---|---|---|---|
| Visual Features | VF | Mean-pooled vision encoder output | Before multimodal projection |
| Vision Token | VT | Hidden states at final vision token position | Decoder layers {1, L/4, L/2, 3L/4, L} |
| Query Token | QT | Hidden states at final query token position | Decoder layers {1, L/4, L/2, 3L/4, L} |
Each representation is fed to a lightweight 3-layer MLP probe (512→256→128→1) trained with binary cross-entropy to predict hallucination occurrence.
| Model | VF | VT | QT | Best |
|---|---|---|---|---|
| Gemma3-12B | 0.6736 | 0.5956 | 0.9349 | QT Layer L |
| FastVLM-7B | 0.6830 | 0.7028 | 0.6136 | VT Layer L |
| LLaVA-Next-8B | 0.6108 | 0.6270 | 0.9026 | QT Layer 3L/4 |
| Molmo-7B | 0.6830 | 0.6867 | 0.9193 | QT Layer L/2 |
| Qwen2.5-VL-7B | 0.7873 | 0.6683 | 0.9150 | QT Layer 3L/4 |
| Llama-3.2-11B-Vision | 0.7703 | 0.7377 | 0.8959 | QT Layer L/2 |
| Phi4-VL-5.6B | 0.6166 | 0.7738 | 0.9033 | QT Layer 3L/4 |
| SmolVLM2-2.2B | 0.7238 | 0.6894 | 0.9014 | QT Layer 3L/4 |
| Average | 0.6935 | 0.6852 | 0.8733 | — |
- Query tokens dominate: QT representations achieve the highest AUROC (avg 0.87) across 7/8 models
- Late layers are most predictive: Optimal QT performance typically at layers 3L/4 or L
- Architectural heterogeneity: Some models (Qwen2.5-VL, Llama-3.2) show strong VF performance (~0.77-0.79), suggesting vision-centric grounding
- FastVLM is unique: Only model where VT outperforms QT, indicating different fusion dynamics
A diverse 10,000-sample benchmark assembled from 6 established VQA datasets:
| Dataset | Focus | Samples | % |
|---|---|---|---|
| AMBER | Discriminative tasks, attributes | 3,926 | 39.3% |
| HaloQuest | Adversarial challenges | 2,784 | 27.8% |
| POPE | Object hallucination | 1,230 | 12.3% |
| MME | Multimodal reasoning | 885 | 8.9% |
| HallusionBench | Visual illusions | 617 | 6.2% |
| MathVista | Mathematical reasoning | 558 | 5.6% |
| Total | 10,000 | 100% |
|
Task Domains
|
Answer Types
|
Hallucination Types
|
| Model | Parameters | Vision Encoder | HuggingFace |
|---|---|---|---|
| Gemma3-12B | 12.2B | SigLIP | google/gemma-3-12b-it |
| FastVLM-7B | 7B | FastViT | apple/FastVLM-7B |
| LLaVA-1.5-8B | 7.6B | CLIP ViT-L/14 | llava-hf/llava-1.5-7b-hf |
| Molmo-7B | 7.2B | OpenAI CLIP | allenai/Molmo-7B-O-0924 |
| Qwen2.5-VL-7B | 7B | ViT (window attn) | Qwen/Qwen2.5-VL-7B-Instruct |
| Llama-3.2-11B-Vision | 10.6B | ViT-H/14 | meta-llama/Llama-3.2-11B-Vision-Instruct |
| Phi4-VL-5.6B | 5.6B | SigLIP-400M | microsoft/Phi-4-multimodal-instruct |
| SmolVLM2-2.2B | 2.2B | SigLIP-400M | HuggingFaceTB/SmolVLM2-2.2B-Instruct |
- Python 3.10+
- PyTorch 2.0+ with CUDA 11.8+
- GPU: NVIDIA RTX 4090 (24GB VRAM) recommended
# Clone the repository
git clone https://github.com/Zesearch/HALP.git
cd HALP
# Create conda environment
conda create -n halp python=3.10
conda activate halp
# Install PyTorch (CUDA 11.8)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# Install dependencies
pip install transformers accelerate h5py pandas numpy scikit-learn matplotlib seaborn tqdm
# Model-specific dependencies
pip install num2words sentencepiece pillowExtract internal representations from a VLM (example: SmolVLM):
cd Extraction_Script
python run_smol_extraction.py \
--csv-path ../FInal_CSV_Hallucination/smolvlm_manually_reviewed.csv \
--images-dir ../HALP_Bench \
--output-dir ../Model_Outputs/Smol/smolvlm_output \
--checkpoint-interval 1000Train hallucination detection probes on extracted embeddings:
cd Secondary_Scripts/probe_training_scripts_results/smolvlm_model_probe
# Train all 11 probes (1 VF + 5 VT + 5 QT)
python run_all_probes.py# Compile AUROC results
python compile_auroc_results.py
# View summary
cat results/test_auroc_summary.csvHALP/
├── Extraction_Script/ # Embedding extraction scripts
│ ├── run_gemma3_extraction.py
│ ├── run_fastvlm_extraction.py
│ ├── run_llava_extraction.py
│ ├── run_llama_extract.py
│ ├── run_molmo_extraction_rtx4090.py
│ ├── run_qwen25vl_extraction.py
│ ├── run_phi4_extract.py
│ └── run_smol_extraction.py
│
├── Model_Outputs/ # Per-model embeddings (HDF5)
│ ├── Gemma_3/
│ ├── FastVLM/
│ ├── LLaVa/
│ ├── LLama_32/
│ ├── Molmo_V1/
│ ├── Phi4_VL/
│ ├── Qwen25_VL/
│ └── Smol/
│
├── FInal_CSV_Hallucination/ # Manually reviewed labels
│ ├── gemma3_manually_reviewed.csv
│ ├── fastvlm_manually_reviewed.csv
│ ├── llava_manually_reviewed.csv
│ ├── molmo_manually_reviewed.csv
│ ├── qwen25vl_manually_reviewed.csv
│ ├── llama32_manually_reviewed.csv
│ ├── phi4vl_manually_reviewed.csv
│ └── smolvlm_manually_reviewed.csv
│
├── HALP_Bench/ # Benchmark images (4,852 files)
│
├── Secondary_Scripts/
│ ├── probe_training_scripts_results/ # Probe training per model
│ │ ├── gemma_model_probe/
│ │ ├── fastvlm_model_probe/
│ │ ├── llava_model_probe/
│ │ ├── molmo_model_probe/
│ │ ├── qwen25vl_model_probe/
│ │ ├── llama32_model_probe/
│ │ ├── phi4vl_model_probe/
│ │ └── smolvlm_model_probe/
│ ├── probe_analysis/ # Cross-model analysis
│ └── detailed_probe_analysis/ # Per-category analysis
│
├── assets/ # Figures for README
│ ├── halp_pipeline.png
│ └── qualitative_examples.png
│
├── EACL_HALP_Camera_Ready.pdf # Research paper
├── project.md # Detailed project documentation
└── README.md # This file
Each model has a dedicated extraction script. The general workflow:
# Example: Extract embeddings from Gemma-3-12B
python Extraction_Script/run_gemma3_extraction.py \
--csv-path FInal_CSV_Hallucination/gemma3_manually_reviewed.csv \
--images-dir HALP_Bench \
--output-dir Model_Outputs/Gemma_3/gemma_output \
--checkpoint-interval 1000Output format (HDF5):
sample_id/
├── vision_only_representation # [D_vision]
├── vision_token_layer_0 # [D_hidden]
├── vision_token_layer_L4 # [D_hidden]
├── vision_token_layer_L2 # [D_hidden]
├── vision_token_layer_3L4 # [D_hidden]
├── vision_token_layer_L # [D_hidden]
├── query_token_layer_0 # [D_hidden]
├── query_token_layer_L4 # [D_hidden]
├── query_token_layer_L2 # [D_hidden]
├── query_token_layer_3L4 # [D_hidden]
├── query_token_layer_L # [D_hidden]
└── metadata (question, gt_answer, model_answer, image_name)
Navigate to the model-specific probe directory:
cd Secondary_Scripts/probe_training_scripts_results/gemma_model_probe
# Train individual probes
python 01_vision_only_probe.py # Visual Features probe
python 02-06_vision_token_probes.py # Vision Token probes (5 layers)
python 07-11_query_token_probes.py # Query Token probes (5 layers)
# Or train all probes at once
python run_all_probes.pyProbe Architecture:
Input [D_hidden] → Linear(512) → ReLU → BN → Dropout(0.3)
→ Linear(256) → ReLU → BN → Dropout(0.3)
→ Linear(128) → ReLU → BN → Dropout(0.3)
→ Linear(1) → Sigmoid
Training Config: Adam (lr=0.001), batch size 32, 50 epochs, BCE loss
cd Secondary_Scripts/probe_analysis
# Run comprehensive analysis across all models
python analyze_all_models.py
# Category-specific analysis
python analyze_probe_by_category.py| Task | GPU | VRAM | Time (per model) |
|---|---|---|---|
| Embedding Extraction | RTX 4090 | 24GB | 3-6 hours |
| Probe Training (11 probes) | RTX 4090 | 4GB | 10-15 minutes |
| Total (8 models) | — | — | ~10 GPU-hours |
We thank the developers of:
- PyTorch and HuggingFace Transformers for the deep learning infrastructure
- Google (Gemma), Meta (LLaMA, LLaVA), Alibaba (Qwen), Microsoft (Phi), Allen AI (Molmo), Apple (FastVLM), and HuggingFace (SmolVLM) for open-sourcing their vision-language models
- The creators of AMBER, HaloQuest, POPE, MME, HallusionBench, and MathVista benchmarks
This project is licensed under the MIT License.
Note: Each VLM retains its original license. Please refer to individual model cards on HuggingFace for specific licensing terms.
If you find this work useful, please cite our paper:
@inproceedings{kogilathota2026halp,
title={{HALP}: Detecting Hallucinations in Vision-Language Models without Generating a Single Token},
author={Kogilathota, Sai Akhil and Vallabha E G, Sripadha and Sun, Luzhe and Zhou, Jiawei},
booktitle={Proceedings of the 2026 Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
year={2026}
}For questions or issues, please open a GitHub Issue.

