Skip to content

microsoft/Phi-4-reasoning-vision-15B

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phi-4-reasoning-vision-15B

Microsoft Hugging Face Model Foundry Paper

Overview

Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking questions about images, reading documents and receipts, helping with homework, interfering about changes in sequences of images, and much more. Beyond these general capabilities it excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens.

Benchmark Results

We've tested on the following benchmarks:

Benchmark Link
AI2D lmms-lab/ai2d · Datasets at Hugging Face
HallusionBench lmms-lab/HallusionBench · Datasets at Hugging Face
MathVerse AI4Math/MathVerse · Datasets at Hugging Face
MathVision MathLLMs/MathVision · Datasets at Hugging Face
MathVista AI4Math/MathVista · Datasets at Hugging Face
MMMU MMMU/MMMU · Datasets at Hugging Face
MMStar Lin-Chen/MMStar · Datasets at Hugging Face
ScreenSpot v2 Voxel51/ScreenSpot-v2 · Datasets at Hugging Face
WeMath We-Math/We-Math · Datasets at Hugging Face
ZEROBench jonathan-roberts1/zerobench · Datasets at Hugging Face

Benchmark Results

Benchmark Score
AI2D_TEST 84.8
HallusionBench 64.4
MathVerse_MINI 44.9
MathVision_MINI 36.2
MathVista_MINI 75.2
MMMU_VAL 54.3
MMStar 64.5
ScreenSpot_v2_Desktop 87.1
ScreenSpot_v2_Mobile 88.6
ScreenSpot_v2_Web 88.8
WeMath 50.1
ZEROBench_sub 17.7

Hosting the Model

Recommended: The easiest way to get started is using Azure Foundry hosting, which requires no GPU hardware or model downloads. Alternatively, you can self-host with vLLM if you have GPU resources available.

Azure Foundry Hosting (Recommended)

Deploy Phi-4-Reasoning-Vision-15B on Azure Foundry without needing to download weights or manage GPU infrastructure.

Setup:

  1. Deploy the model on Azure Foundry and obtain your endpoint URL, API key and deployment name.

Use the following sample script, be sure to replace the following:

  • IMAGE_PATH, ENDPOINT_BASE, API_KEY, DEPLOYMENT_NAME
  • Optional: content of the payload message
import base64
import os
import requests

IMAGE_PATH = "<replace_with_your_image>.jpg"

ENDPOINT_BASE = "<your_base_endpoint_url>"
API_KEY = "<your_api_key_here>"
DEPLOYMENT_NAME = "Phi-4-Reasoning-Vision-15B" # replace here with your deployment name

def main():
    with open(IMAGE_PATH, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode("utf-8")

    payload = {
        "model": "Phi-4-Reasoning-Vision-15B",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in detail."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_b64}"
                        },
                    },
                ],
            }
        ],
        "max_tokens": 4096,
        "temperature": 0.0,
    }

    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
        "azureml-model-deployment": DEPLOYMENT_NAME,
    }

    url = f"{ENDPOINT_BASE}/v1/chat/completions"
    print(f"Requesting: {url}")

    resp = requests.post(url, json=payload, headers=headers, timeout=120)
    resp.raise_for_status()

    result = resp.json()
    print("\n--- Response ---")
    print(result["choices"][0]["message"]["content"])


if __name__ == "__main__":
    main()

That's it! No GPU or model downloads required.

Self-hosting with Transformers

If you have access to GPU resources, you can run inference using HuggingFace Transformers library. This requires a GPU machine with sufficient VRAM (e.g., 40GB or more).

Instructions available here.

Self-hosting with vLLM

If you have access to GPU resources, you can self-host using vLLM. This requires a GPU machine with sufficient VRAM (e.g., 40GB or more).

Instructions available here.

Citation

If you use Phi-4-Reasoning-Vision in your research, please use the following BibTeX entry.

@article{phi4vr14b2026,
  title={Phi-4-Vision-Reasoning Technical Report},
  author={Aneja, Jyoti and Harrison, Michael and Joshi, Neel and LaBonte, Tyler and Langford, John and Salinas, Eduardo and Ward, Rachel},
  journal={arXiv:2511.19663},
  year={2026}
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors