{"id":40595,"date":"2025-03-17T06:00:00","date_gmt":"2025-03-17T00:30:00","guid":{"rendered":"https:\/\/debuggercafe.com\/?p=40595"},"modified":"2025-06-02T19:41:23","modified_gmt":"2025-06-02T14:11:23","slug":"moondream","status":"publish","type":"post","link":"https:\/\/debuggercafe.com\/moondream\/","title":{"rendered":"Moondream &#8211; One Model for Captioning, Pointing, and Detection"},"content":{"rendered":"\n<p>Vision Language Models (VLMs) are undoubtedly one of the most innovative components of Generative AI. With AI organizations pouring millions into building them, large proprietary architectures are all the hype. All this comes with a bigger caveat: VLMs (even the largest) models cannot do all the tasks that a standard vision model can do. These include pointing and detection. With all this said, <strong><em>Moondream (Moondream2)<\/em><\/strong><span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\">,<em><strong>&nbsp;a sub 2B parameter model,<\/strong><\/em>&nbsp;can do<\/span> four tasks &#8211; <strong><em>image captioning, visual querying, pointing to objects, and object detection<\/em><\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-horizontal is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-499968f5 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background wp-element-button\" href=\"#download-code\"><strong>Jump to Download Code<\/strong><\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks-outputs.gif\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"640\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks-outputs.gif\" alt=\"Results of all the tasks that we can perform using Moondream. They are image captioning, visual query, object pointing, and object detection.\" class=\"wp-image-40667\"\/><\/a><figcaption class=\"wp-element-caption\">Figure 1. Results of all the tasks that we can perform using Moondream. They are image captioning, visual query, object pointing, and object detection.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>This might seem like a small feat. However, given that the model is exactly 1.93B parameters with a Phi 1.5 decoder and SigLIP vision encoder, this is impressive. Moreover, even the free version of ChatGPT cannot detect objects at the time of writing this article.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/chatgpt-object-detection-inability.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"778\" height=\"948\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/chatgpt-object-detection-inability.png\" alt=\"ChatGPT is unable to carry out object detection natively.\" class=\"wp-image-40669\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/chatgpt-object-detection-inability.png 778w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/chatgpt-object-detection-inability-246x300.png 246w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/chatgpt-object-detection-inability-768x936.png 768w\" sizes=\"auto, (max-width: 778px) 100vw, 778px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 2. ChatGPT is unable to carry out object detection natively.<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\"><em>What will we cover using Moondream<\/em>?<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>What is Moondream, who created it, and what can we do using it?<\/em><\/li>\n\n\n\n<li><em>Covering all the Moondream tasks:<\/em>\n<ul class=\"wp-block-list\">\n<li><em>Image captioning and visual querying.<\/em><\/li>\n\n\n\n<li><em>Pointing to objects in images and videos using Moondream.<\/em><\/li>\n\n\n\n<li><em>Detecting objects in images and videos using Moondream.<\/em><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Moondream?<\/h2>\n\n\n\n<p>Moondream is a Small Vision Language Model (SVLM). Created by the user <a href=\"https:\/\/huggingface.co\/vikhyatk\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Vik Korrapati<\/strong><\/a> on Hugging Face, it is primarily meant for edge devices.<\/p>\n\n\n\n<p>Moondream has two versions, version 1 and version 2. We will use the Moondream2 model in this article which is referred to as Moondream from here on for simplicity.<\/p>\n\n\n\n<p>It is extremely accurate and uses just around 5.2GB of VRAM to load in FP16. We can easily load it using Hugging Face Transformers using the following syntax:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"vikhyatk\/moondream2\",\n    revision=\"2025-01-09\",\n    trust_remote_code=True,\n    device_map={\"\": \"cuda\"}\n)<\/pre>\n\n\n\n<p>It is a great rival to the <strong><a href=\"https:\/\/debuggercafe.com\/introduction-to-molmo\/\" target=\"_blank\" rel=\"noreferrer noopener\">Molmo VLM<\/a><\/strong> whose smallest model is 7B parameters and cannot perform object detection yet.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What Tasks Can We Perform Using MoonDream?<\/h3>\n\n\n\n<p>We can carry out image captioning, visual querying, pointing to objects, and object detection using Moondream.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"800\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks.png\" alt=\"All the tasks that Moondream can perform.\" class=\"wp-image-40672\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks.png 800w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks-300x300.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks-150x150.png 150w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/moondream-tasks-768x768.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3. All the tasks that Moondream can perform.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>Following is an example syntax showing each of the tasks in code:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Captioning\nprint(\"Short caption:\")\nprint(model.caption(image, length=\"short\")[\"caption\"])\n\nprint(\"\\nNormal caption:\")\nfor t in model.caption(image, length=\"normal\", stream=True)[\"caption\"]:\n    # Streaming generation example, supported for caption() and detect()\n    print(t, end=\"\", flush=True)\nprint(model.caption(image, length=\"normal\"))\n\n# Visual Querying\nprint(\"\\nVisual query: 'How many people are in the image?'\")\nprint(model.query(image, \"How many people are in the image?\")[\"answer\"])\n\n# Object Detection\nprint(\"\\nObject detection: 'face'\")\nobjects = model.detect(image, \"face\")[\"objects\"]\nprint(f\"Found {len(objects)} face(s)\")\n\n# Pointing\nprint(\"\\nPointing: 'person'\")\npoints = model.point(image, \"person\")[\"points\"]\nprint(f\"Found {len(points)} person(s)\")<\/pre>\n\n\n\n<p>For each task we either use the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">caption<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">query<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">point<\/code>, or <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">detect <\/code>methods respectively. Further in the article, we will cover these tasks in more detail with complete examples.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Directory Structure<\/h2>\n\n\n\n<p>Let&#8217;s take a look at the directory structure.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\u251c\u2500\u2500 input\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 giraffes.jpg\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 people.jpg\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 video_1.mp4\n\u251c\u2500\u2500 moondream_caption.py\n\u251c\u2500\u2500 moondream_object_detection.py\n\u251c\u2500\u2500 moondream_object_detection_video.py\n\u251c\u2500\u2500 moondream_pointing.py\n\u251c\u2500\u2500 moondream_pointing_video.py\n\u251c\u2500\u2500 moondream_visual_query.py\n\u251c\u2500\u2500 outputs\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 requirements.txt\n\u2514\u2500\u2500 tree.txt<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">input<\/code> directory contains the images and videos that we will run inference on.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">outputs<\/code> directory contains the inference results.<\/li>\n\n\n\n<li>We have six Python scripts for running inference using various tasks on images and videos. <\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#ffb76a\"><strong><em>You can download all the code and requirements files from the download section.<\/em><\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"download-code\">Download Code<\/h3>\n\n\n\n<div class=\"wp-block-button is-style-outline center\"><a data-sumome-listbuilder-id=\"ef8ed309-0677-4a90-bedb-50c25b111bd3\" class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background\"><b>Download the Source Code for this Tutorial<\/b><\/a><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Installing Requirements<\/h3>\n\n\n\n<p>Along with the Hugging Face <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">transformers<\/code> library, we also need specific versions of <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">pyvips<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">pyvips-binary<\/code> for image processing. We can install these using the requirements file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install -r requirements.txt<\/pre>\n\n\n\n<p>With this, we are done with the initial discussion and the setup. Let&#8217;s jump into the coding part now.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Moondream for Image Captioning, Visual Querying, Object Pointing, and Object Detection<\/h2>\n\n\n\n<p>Starting from this section, we will tackle each task Moondream can perform.<\/p>\n\n\n\n<p>Along the way, we will also discover, how easy it is to carry out each task that makes using the Moondream model a breeze.<\/p>\n\n\n\n<p>While covering the image-specific inference runs, we will use the following image throughout.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/giraffes.jpg\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"555\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/giraffes.jpg\" alt=\"An image showing a group of five giraffes that we will use for inference.\" class=\"wp-image-40674\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/giraffes.jpg 640w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/giraffes-300x260.jpg 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4. An image showing a group of five giraffes that we will use for inference.<\/figcaption><\/figure>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\">Image Captioning using Moondream<\/h3>\n\n\n\n<p>We will start with image captioning.<\/p>\n\n\n\n<p>The code for image captioning is present in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">moondream_caption.py<\/code> file.<\/p>\n\n\n\n<p>Let&#8217;s start with importing the modules and loading the model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"moondream_caption.py\" data-enlighter-group=\"moondream_caption_1\">from transformers import AutoModelForCausalLM\nfrom PIL import Image\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    'vikhyatk\/moondream2',\n    revision='2025-01-09',\n    trust_remote_code=True,\n    device_map={'': 'cuda'}\n)<\/pre>\n\n\n\n<p>We will use the <strong><a href=\"https:\/\/huggingface.co\/vikhyatk\/moondream2\" target=\"_blank\" rel=\"noreferrer noopener\">revision of the model<\/a><\/strong> as provided in the official Hugging Face repository.<\/p>\n\n\n\n<p>To generate captions for images, we simply need to read the image using <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">PIL<\/code> and pass it through the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">caption<\/code> method of the model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"10\" data-enlighter-title=\"moondream_caption.py\" data-enlighter-group=\"moondream_caption_2\">image = Image.open('input\/giraffes.jpg')\n\n# Normal caption.\nprint('NORMAL CAPTION:\\n')\nfor t in model.caption(image, length='normal', stream=True)['caption']:\n    print(t, end='', flush=True)\nprint('\\n')\n\n# Short Captioning\nprint('SHORT CAPTION:\\n')\nfor t in model.caption(image, length='short', stream=True)['caption']:\n    print(t, end='', flush=True)\nprint('\\n')<\/pre>\n\n\n\n<p>Moondream supports generating two types of captions: <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">short<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">normal<\/code>. The former generates concise captions, almost like an alternate tag for an image. The latter generates a more detailed caption, just like we may need for any website&#8217;s figure captions. The results are stored in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">'caption'<\/code> key of the output dictionary.<\/p>\n\n\n\n<p>Additionally, for image captioning, the model supports text streaming, which we are using above.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python moondream_caption.py<\/pre>\n\n\n\n<p>Following are the results that we get for the image.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">NORMAL CAPTION:\n\n In a dry savanna landscape, a group of at least five giraffes with distinctive brown and white coats are walking along a railway track. The giraffes are moving towards the right side of the image, their long necks reaching towards the clear blue sky dotted with fluffy white clouds. The railway track, composed of rusty metal rails, cuts through the landscape, disappearing into the distance. The savanna is a light tan or beige color, with sparse vegetation. In the distance, a mountain range stretches across the horizon, its peaks partially obscured by the clouds. A small section of a fence or barrier is also visible along the right edge of the railway track.\n\nSHORT CAPTION:\n\n Four giraffes walk along a railway track in the African savanna, their long necks reaching towards the clear blue sky.<\/pre>\n\n\n\n<p>The descriptions seem good, however, there are a few nuances that catch our eye. In both captions, the count of the giraffes differs. The normal caption mentions five giraffes which is correct while the short caption mentions four giraffes. Furthermore, the model describes that the giraffes are moving toward the right while they are moving toward the left from the perspective of the viewer.<\/p>\n\n\n\n<p>These subtle details will probably be addressed in future versions of the Moondream.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Visual Querying Using Moondream<\/h3>\n\n\n\n<p>Visual Querying refers to asking nuanced questions about an image. Moondream supports the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">query<\/code> method for this task.<\/p>\n\n\n\n<p>The code for visual querying is present in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">moondream_visual_query.py<\/code> script.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"moondream_visual_query.py\" data-enlighter-group=\"moondream_visual_query_1\">from transformers import AutoModelForCausalLM\nfrom PIL import Image\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    'vikhyatk\/moondream2',\n    revision='2025-01-09',\n    trust_remote_code=True,\n    device_map={'': 'cuda'}\n)\n\nimage = Image.open('input\/giraffes.jpg')\n\nquery = 'How many giraffes are there in the image?'\n\nprint(query)\nprint(model.query(image, query)['answer'])<\/pre>\n\n\n\n<p>We invoke the query method of the model asking about the number of giraffes in the image. The final answer is present in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">'answer'<\/code> key of the output dictionary.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python moondream_visual_query.py<\/pre>\n\n\n\n<p>We get the following answer as output.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"> There are five giraffes in the image.<\/pre>\n\n\n\n<p>The model answers the question correctly. Of course, exploring more images will provide us with more information regarding its strengths and limitations, however, here we will cover all the tasks that it can perform.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Object Pointing using Moondream<\/h3>\n\n\n\n<p>In object pointing, the Moondream model points to an object with a single (x, y) coordinate pair. If there are multiple objects of similar kind, it will point to all the objects it can perceive.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Object Pointing in Images<\/h4>\n\n\n\n<p>Let&#8217;s start with pointing in images. The code for this is present in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">moondream_pointing.py<\/code> file.<\/p>\n\n\n\n<p>Starting with the imports, argument parser to pass input images, create an output directory, and load the model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"moondream_pointing.py\" data-enlighter-group=\"moondream_pointing_1\">from transformers import AutoModelForCausalLM\nfrom PIL import Image\n\nimport cv2\nimport numpy as np\nimport argparse\nimport os\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    '--input',\n    help='path to the input image'\n)\nargs = parser.parse_args()\n\nout_dir = 'outputs'\nos.makedirs(out_dir, exist_ok=True)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    'vikhyatk\/moondream2',\n    revision='2025-01-09',\n    trust_remote_code=True,\n    device_map={'': 'cuda'}\n)<\/pre>\n\n\n\n<p>Next, we read the input image using PIL and pass it through the model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"25\" data-enlighter-title=\"moondream_pointing.py\" data-enlighter-group=\"moondream_pointing_2\">image = Image.open(args.input)\n\npointing = 'giraffes'\nprint(f\"Pointing to: {pointing}\")\npoints = model.point(image, pointing)['points']\nprint(points)\n\nimage_array = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)<\/pre>\n\n\n\n<p>We call the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">point<\/code> method of the model with a string name indicating the objects that we want to point to. We are trying to get the coordinates of giraffes in the images.<\/p>\n\n\n\n<p>The output from the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">'point'<\/code> key of the dictionary is in the following format.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[{'x': 0.30078125, 'y': 0.5}, {'x': 0.4951171875, 'y': 0.50390625}, \n{'x': 0.595703125, 'y': 0.5}, {'x': 0.654296875, 'y': 0.505859375}, \n{'x': 0.689453125, 'y': 0.4970703125}]<\/pre>\n\n\n\n<p>We have five pairs of (x, y) coordinates for the five giraffes normalized according to the image size. We need to denormalize them and annotate on the original image for better visualization\/<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"33\" data-enlighter-title=\"moondream_pointing.py\" data-enlighter-group=\"moondream_pointing_3\">image_array = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)\n\nfor point in points:\n    h, w = image_array.shape[:2]\n    x = point['x']*w\n    y = point['y']*h\n    \n    # Create a small blue circle.\n    cv2.circle(\n        image_array,\n        center=(int(x), int(y)),\n        radius=2,\n        color=(255, 0, 0),\n        thickness=-1,\n        lineType=cv2.LINE_AA\n    )\n    # Create a larger white circle.\n    cv2.circle(\n        image_array,\n        center=(int(x), int(y)),\n        radius=4,\n        color=(255, 255, 255),\n        thickness=2,\n        lineType=cv2.LINE_AA\n    )\n    # Create a larger red circle.\n    cv2.circle(\n        image_array,\n        center=(int(x), int(y)),\n        radius=6,\n        color=(0, 0, 255),\n        thickness=2,\n        lineType=cv2.LINE_AA\n    )\n\nfile_name = args.input.split(os.path.sep)[-1]\ncv2.imwrite(os.path.join(out_dir, 'pointing_'+file_name), image_array)\n\ncv2.imshow('Image', image_array)\ncv2.waitKey(0)<\/pre>\n\n\n\n<p>For each coordinate pair, we annotate with three circles (completely cosmetic reasons) for better visualization. We then save the results to disk and visualize on screen.<\/p>\n\n\n\n<p>We can execute the script with the following command.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python moondream_pointing.py --input input\/giraffes.jpg<\/pre>\n\n\n\n<p>We get the following output.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/pointing_giraffes.jpg\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"555\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/pointing_giraffes.jpg\" alt=\"Result of object pointing using Moondream. The model is able to point to all five giraffes accurately.\" class=\"wp-image-40677\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/pointing_giraffes.jpg 640w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/pointing_giraffes-300x260.jpg 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 5. Result of object pointing using Moondream. The model is able to point to all five giraffes accurately.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The model is able to point to the giraffes successfully.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Object Pointing in Videos<\/h4>\n\n\n\n<p>We can also point to objects in videos. The process is very similar to that of images. Instead of a single image, we loop through all the frames of the video.<\/p>\n\n\n\n<p>The following code is present in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">moondream_pointing_video.py<\/code> file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"moondream_pointing_video.py\" data-enlighter-group=\"moondream_pointing_video_1\">from transformers import AutoModelForCausalLM\nfrom PIL import Image\n\nimport cv2\nimport time\nimport argparse\nimport os\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    '--input',\n    help='path to the input video'\n)\nargs = parser.parse_args()\n\nout_dir = 'outputs'\nos.makedirs(out_dir, exist_ok=True)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    'vikhyatk\/moondream2',\n    revision='2025-01-09',\n    trust_remote_code=True,\n    device_map={'': 'cuda'}\n)\n\ncap = cv2.VideoCapture(args.input)\n\nframe_width = int(cap.get(3))\nframe_height = int(cap.get(4))\n\nsave_name = args.input.split(os.path.sep)[-1]\n# Define codec and create VideoWriter object.\nout = cv2.VideoWriter(f\"{out_dir}\/pointing_{save_name}\", \n                    cv2.VideoWriter_fourcc(*'mp4v'), 30, \n                    (frame_width, frame_height))\n\npointing = 'giraffes'\n\nframe_count = 0 # To count total frames.\ntotal_fps = 0 # To get the final frames per second.\n\nwhile cap.isOpened():\n    ret, frame = cap.read()\n\n    if ret:\n        frame_pil = Image.fromarray(frame).convert('RGB')\n\n        print(f\"Pointing to: {pointing}\")\n        start_time = time.time()\n\n        points = model.point(frame_pil, pointing)['points']\n\n        end_time = time.time()\n\n        fps = 1 \/ (end_time - start_time)\n        total_fps += fps\n        frame_count += 1\n\n        for point in points:\n            h, w = frame.shape[:2]\n            x = point['x']*w\n            y = point['y']*h\n            \n            # Create a small blue circle.\n            cv2.circle(\n                frame,\n                center=(int(x), int(y)),\n                radius=2,\n                color=(255, 0, 0),\n                thickness=-1,\n                lineType=cv2.LINE_AA\n            )\n            # Create a larger white circle.\n            cv2.circle(\n                frame,\n                center=(int(x), int(y)),\n                radius=4,\n                color=(255, 255, 255),\n                thickness=2,\n                lineType=cv2.LINE_AA\n            )\n            # Create a larger red circle.\n            cv2.circle(\n                frame,\n                center=(int(x), int(y)),\n                radius=6,\n                color=(0, 0, 255),\n                thickness=2,\n                lineType=cv2.LINE_AA\n            )\n\n        out.write(frame)\n\n        cv2.imshow('Image [Press Q to exit]', frame)\n        # Press `q` to exit\n        if cv2.waitKey(1) &amp; 0xFF == ord('q'):\n            break\n    else:\n        break\n\n# Release VideoCapture().\ncap.release()\n# Close all frames and video windows.\ncv2.destroyAllWindows()\n\n# Calculate and print the average FPS.\navg_fps = total_fps \/ frame_count\nprint(f\"Average FPS: {avg_fps:.3f}\")<\/pre>\n\n\n\n<p>We pass a video path through the argument parser and extract the width and height for saving it to disk later. One thing to note is that we need to pass the object string that we want to point to for each frame (<strong>lines 48 to 53<\/strong>).<\/p>\n\n\n\n<p>Let&#8217;s execute the script.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python moondream_pointing_video.py --input input\/video_1.mp4<\/pre>\n\n\n\n<p>We have the following video output.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 640 \/ 360;\" width=\"640\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/pointing_video_1.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 1. Object pointing in video using Moondream. Here we are pointing to giraffes.<\/figcaption><\/figure>\n\n\n\n<p>It runs with an <strong>average of 4.8 FPS<\/strong> on an RTX 3080 GPU. Of course, this is not real-time. However, we can expect such models to get better and faster from here on.<\/p>\n\n\n\n<p>The model is able to point to the giraffes in almost all the frames. The giraffe at the back loses its point when it gets occluded later on. <strong><em>Can we use point tracking to solve this and make the pointing process more stable?<\/em><\/strong>  <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Object Detection in Moondream<\/h3>\n\n\n\n<p>The process of object detection in Moondream is very similar to that of pointing. We call the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">detect<\/code> method of the model here.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Object Detection in Images<\/h4>\n\n\n\n<p>Starting with image object detection whose code is present in <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">moondream_object_detection.py<\/code> file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"moondream_object_detection.py\" data-enlighter-group=\"moondream_object_detection_1\">from transformers import AutoModelForCausalLM\nfrom PIL import Image\n\nimport cv2\nimport numpy as np\nimport argparse\nimport os\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    '--input',\n    help='path to the input image'\n)\nargs = parser.parse_args()\n\nout_dir = 'outputs'\nos.makedirs(out_dir, exist_ok=True)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    'vikhyatk\/moondream2',\n    revision='2025-01-09',\n    trust_remote_code=True,\n    device_map={'': 'cuda'}\n)\n\nimage = Image.open('input\/giraffes.jpg')\n\ndetecting = 'giraffes'\nprint(f\"Object detection: {detecting}\")\nobjects = model.detect(image, detecting)['objects']\nprint(objects)\n\nimage_array = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)\n\nfor object in objects:\n    h, w = image_array.shape[:2]\n    xmin = object['x_min']*w\n    ymin = object['y_min']*h\n    xmax = object['x_max']*w\n    ymax = object['y_max']*h\n    \n    cv2.rectangle(\n        image_array,\n        pt1=(int(xmin), int(ymin)),\n        pt2=(int(xmax), int(ymax)),\n        color=(0, 0, 255),\n        thickness=4,\n        lineType=cv2.LINE_AA\n    )\n    cv2.rectangle(\n        image_array,\n        pt1=(int(xmin), int(ymin)),\n        pt2=(int(xmax), int(ymax)),\n        color=(255, 255, 255),\n        thickness=1,\n        lineType=cv2.LINE_AA\n    )\n\nfile_name = args.input.split(os.path.sep)[-1]\ncv2.imwrite(os.path.join(out_dir, 'detection_'+file_name), image_array)\n\ncv2.imshow('Image', image_array)\ncv2.waitKey(0)<\/pre>\n\n\n\n<p>The only parts that change are the method calling and how we visualize the results. Insread of (x, y) coordinate pairs, here, the model returns (xmin, ymin, xmax, ymax) for each object in image dimension normalized format.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[{'x_min': 0.2080078125, 'y_min': 0.349609375, 'x_max': 0.3720703125, 'y_max': 0.6875},\n {'x_min': 0.57421875, 'y_min': 0.3955078125, 'x_max': 0.728515625, 'y_max': 0.6787109375},\n {'x_min': 0.546875, 'y_min': 0.4013671875, 'x_max': 0.701171875, 'y_max': 0.6689453125},\n {'x_min': 0.4306640625, 'y_min': 0.40625, 'x_max': 0.5927734375, 'y_max': 0.673828125}, \n{'x_min': 0.666015625, 'y_min': 0.43408203125, 'x_max': 0.78125, 'y_max': 0.69677734375}]<\/pre>\n\n\n\n<p>Let&#8217;s execute the script and visualize the results.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python moondream_object_detection.py --input input\/giraffes.jpg<\/pre>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/detection_giraffes.jpg\" target=\"_blank\" rel=\" noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"555\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/detection_giraffes.jpg\" alt=\"Object detection using Moondream. The model can detect all the five giraffes, including the ones that are partially occluded.\" class=\"wp-image-40680\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/detection_giraffes.jpg 640w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/detection_giraffes-300x260.jpg 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 6. Object detection using Moondream. The model can detect all the five giraffes, including the ones that are partially occluded.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The model is successful in detecting all the giraffes in this case.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Object Detection in Videos<\/h4>\n\n\n\n<p>Similarly, we can detect objects in videos. The following code is present in <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">moondream_object_detection_video.py<\/code> file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"moondream_object_detection_video.py\" data-enlighter-group=\"moondream_object_detection_video_1\">from transformers import AutoModelForCausalLM\nfrom PIL import Image\n\nimport cv2\nimport os\nimport time\nimport argparse\nimport os\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    '--input',\n    help='path to the input video'\n)\nargs = parser.parse_args()\n\nout_dir = 'outputs'\nos.makedirs(out_dir, exist_ok=True)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n    'vikhyatk\/moondream2',\n    revision='2025-01-09',\n    trust_remote_code=True,\n    device_map={'': 'cuda'},\n)\n\ncap = cv2.VideoCapture('input\/video_1.mp4')\n\nframe_width = int(cap.get(3))\nframe_height = int(cap.get(4))\n\nsave_name = args.input.split(os.path.sep)[-1]\n# Define codec and create VideoWriter object.\nout = cv2.VideoWriter(f\"{out_dir}\/detection_{save_name}\", \n                    cv2.VideoWriter_fourcc(*'mp4v'), 30, \n                    (frame_width, frame_height))\n\ndetecting = 'giraffe'\n\nframe_count = 0 # To count total frames.\ntotal_fps = 0 # To get the final frames per second.\n\nwhile cap.isOpened():\n    ret, frame = cap.read()\n\n    if ret:\n        frame_pil = Image.fromarray(frame).convert('RGB')\n\n        print(f\"Object detection: {detecting}\")\n        start_time = time.time()\n\n        objects = model.detect(frame_pil, detecting)['objects']\n\n        end_time = time.time()\n\n        fps = 1 \/ (end_time - start_time)\n        total_fps += fps\n        frame_count += 1\n\n        for object in objects:\n            h, w = frame.shape[:2]\n            xmin = object['x_min']*w\n            ymin = object['y_min']*h\n            xmax = object['x_max']*w\n            ymax = object['y_max']*h\n            \n            cv2.rectangle(\n                frame,\n                pt1=(int(xmin), int(ymin)),\n                pt2=(int(xmax), int(ymax)),\n                color=(0, 0, 255),\n                thickness=4,\n                lineType=cv2.LINE_AA\n            )\n            cv2.rectangle(\n                frame,\n                pt1=(int(xmin), int(ymin)),\n                pt2=(int(xmax), int(ymax)),\n                color=(255, 255, 255),\n                thickness=1,\n                lineType=cv2.LINE_AA\n            )\n\n        out.write(frame)\n\n        cv2.imshow('Image [Press Q to exit]', frame)\n        # Press `q` to exit\n        if cv2.waitKey(1) &amp; 0xFF == ord('q'):\n            break\n    else:\n        break\n\n# Release VideoCapture().\ncap.release()\n# Close all frames and video windows.\ncv2.destroyAllWindows()\n\n# Calculate and print the average FPS.\navg_fps = total_fps \/ frame_count\nprint(f\"Average FPS: {avg_fps:.3f}\")<\/pre>\n\n\n\n<p>Let&#8217;s execute the script.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python moondream_object_detection_video.py --input input\/video_1.mp4<\/pre>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 640 \/ 360;\" width=\"640\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/detection_video_1.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 2. Object detection in video using Moondream. The model is able to consistently detect the two giraffes in the front.<\/figcaption><\/figure>\n\n\n\n<p>As we can see, the model can detect the giraffes, still, we have the same occlusion issue that we had in the case of pointing. The average FPS for video inference using Moondream detection was 4.1 FPS.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Summary and Conclusion<\/h3>\n\n\n\n<p>In this article, we covered inference and several tasks using Moondream. We started with a short discussion about the model, then covered all the tasks including caption, visual query, pointing, and detection. In future articles, we will dive deeper into these while trying to address some of the issues that we discussed. I hope this article was worth your time.<\/p>\n\n\n\n<p>If you have anu questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.<\/p>\n\n\n\n<p>You can contact me using the <strong><a aria-label=\"Contact (opens in a new tab)\" href=\"https:\/\/debuggercafe.com\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">Contact<\/a><\/strong> section. You can also find me on <strong><a aria-label=\"LinkedIn (opens in a new tab)\" href=\"https:\/\/www.linkedin.com\/in\/sovit-rath\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a><\/strong>, and <strong><a href=\"https:\/\/x.com\/SovitRath5\" target=\"_blank\" rel=\"noreferrer noopener\">X<\/a><\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we cover the Moondream model which is a VLM (Vision Language Model) that can be used for image captioning, visual querying, object pointing, and object detection.<\/p>\n","protected":false},"author":1,"featured_media":40686,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[76,1154,820,1329,1062],"tags":[1337,1186,1188,1189,1187,1338,1185,1184,1183],"class_list":["post-40595","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-vision","category-generative-ai","category-hugging-face","category-hugging-face-transformers","category-vlms","tag-hugging-face-moondream","tag-moondream","tag-moondream-captioning","tag-moondream-object-detection","tag-moondream-pointing","tag-moondream-transformers","tag-moondream-visual-querying","tag-moondream1","tag-moondream2"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Moondream - One Model for Captioning, Pointing, and Detection<\/title>\n<meta name=\"description\" content=\"Moondream is a small Vision Language Model that we can use for image captioning, visual querying, object pointing, and object detection.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/debuggercafe.com\/moondream\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Moondream - One Model for Captioning, Pointing, and Detection\" \/>\n<meta property=\"og:description\" content=\"Moondream is a small Vision Language Model that we can use for image captioning, visual querying, object pointing, and object detection.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/debuggercafe.com\/moondream\/\" \/>\n<meta property=\"og:site_name\" content=\"DebuggerCafe\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/profile.php?id=100013731104496\" \/>\n<meta property=\"article:published_time\" content=\"2025-03-17T00:30:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-02T14:11:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sovit Ranjan Rath\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:site\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sovit Ranjan Rath\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"17 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/debuggercafe.com\/moondream\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/moondream\/\"},\"author\":{\"name\":\"Sovit Ranjan Rath\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"headline\":\"Moondream &#8211; One Model for Captioning, Pointing, and Detection\",\"datePublished\":\"2025-03-17T00:30:00+00:00\",\"dateModified\":\"2025-06-02T14:11:23+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/moondream\/\"},\"wordCount\":1606,\"commentCount\":1,\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/moondream\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png\",\"keywords\":[\"Hugging Face Moondream\",\"Moondream\",\"Moondream Captioning\",\"Moondream Object Detection\",\"Moondream Pointing\",\"Moondream Transformers\",\"Moondream Visual Querying\",\"Moondream1\",\"Moondream2\"],\"articleSection\":[\"Computer Vision\",\"Generative AI\",\"Hugging Face\",\"Hugging Face Transformers\",\"VLMs\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/debuggercafe.com\/moondream\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/debuggercafe.com\/moondream\/\",\"url\":\"https:\/\/debuggercafe.com\/moondream\/\",\"name\":\"Moondream - One Model for Captioning, Pointing, and Detection\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/moondream\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/moondream\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png\",\"datePublished\":\"2025-03-17T00:30:00+00:00\",\"dateModified\":\"2025-06-02T14:11:23+00:00\",\"author\":{\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"description\":\"Moondream is a small Vision Language Model that we can use for image captioning, visual querying, object pointing, and object detection.\",\"breadcrumb\":{\"@id\":\"https:\/\/debuggercafe.com\/moondream\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/debuggercafe.com\/moondream\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/moondream\/#primaryimage\",\"url\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png\",\"contentUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png\",\"width\":1000,\"height\":563,\"caption\":\"Moondream \u2013 One Model for Captioning, Pointing, and Detection\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/debuggercafe.com\/moondream\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/debuggercafe.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Moondream &#8211; One Model for Captioning, Pointing, and Detection\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/debuggercafe.com\/#website\",\"url\":\"https:\/\/debuggercafe.com\/\",\"name\":\"DebuggerCafe\",\"description\":\"Machine Learning and Deep Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/debuggercafe.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\",\"name\":\"Sovit Ranjan Rath\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"caption\":\"Sovit Ranjan Rath\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Moondream - One Model for Captioning, Pointing, and Detection","description":"Moondream is a small Vision Language Model that we can use for image captioning, visual querying, object pointing, and object detection.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/debuggercafe.com\/moondream\/","og_locale":"en_US","og_type":"article","og_title":"Moondream - One Model for Captioning, Pointing, and Detection","og_description":"Moondream is a small Vision Language Model that we can use for image captioning, visual querying, object pointing, and object detection.","og_url":"https:\/\/debuggercafe.com\/moondream\/","og_site_name":"DebuggerCafe","article_publisher":"https:\/\/www.facebook.com\/profile.php?id=100013731104496","article_published_time":"2025-03-17T00:30:00+00:00","article_modified_time":"2025-06-02T14:11:23+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png","type":"image\/png"}],"author":"Sovit Ranjan Rath","twitter_card":"summary_large_image","twitter_creator":"@SovitRath5","twitter_site":"@SovitRath5","twitter_misc":{"Written by":"Sovit Ranjan Rath","Est. reading time":"17 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/debuggercafe.com\/moondream\/#article","isPartOf":{"@id":"https:\/\/debuggercafe.com\/moondream\/"},"author":{"name":"Sovit Ranjan Rath","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"headline":"Moondream &#8211; One Model for Captioning, Pointing, and Detection","datePublished":"2025-03-17T00:30:00+00:00","dateModified":"2025-06-02T14:11:23+00:00","mainEntityOfPage":{"@id":"https:\/\/debuggercafe.com\/moondream\/"},"wordCount":1606,"commentCount":1,"image":{"@id":"https:\/\/debuggercafe.com\/moondream\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png","keywords":["Hugging Face Moondream","Moondream","Moondream Captioning","Moondream Object Detection","Moondream Pointing","Moondream Transformers","Moondream Visual Querying","Moondream1","Moondream2"],"articleSection":["Computer Vision","Generative AI","Hugging Face","Hugging Face Transformers","VLMs"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/debuggercafe.com\/moondream\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/debuggercafe.com\/moondream\/","url":"https:\/\/debuggercafe.com\/moondream\/","name":"Moondream - One Model for Captioning, Pointing, and Detection","isPartOf":{"@id":"https:\/\/debuggercafe.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/debuggercafe.com\/moondream\/#primaryimage"},"image":{"@id":"https:\/\/debuggercafe.com\/moondream\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png","datePublished":"2025-03-17T00:30:00+00:00","dateModified":"2025-06-02T14:11:23+00:00","author":{"@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"description":"Moondream is a small Vision Language Model that we can use for image captioning, visual querying, object pointing, and object detection.","breadcrumb":{"@id":"https:\/\/debuggercafe.com\/moondream\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/debuggercafe.com\/moondream\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/moondream\/#primaryimage","url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png","contentUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2025\/01\/Moondream-\u2013-One-Model-for-Captioning-Pointing-and-Detection-e1738114663944.png","width":1000,"height":563,"caption":"Moondream \u2013 One Model for Captioning, Pointing, and Detection"},{"@type":"BreadcrumbList","@id":"https:\/\/debuggercafe.com\/moondream\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/debuggercafe.com\/"},{"@type":"ListItem","position":2,"name":"Moondream &#8211; One Model for Captioning, Pointing, and Detection"}]},{"@type":"WebSite","@id":"https:\/\/debuggercafe.com\/#website","url":"https:\/\/debuggercafe.com\/","name":"DebuggerCafe","description":"Machine Learning and Deep Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/debuggercafe.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752","name":"Sovit Ranjan Rath","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","caption":"Sovit Ranjan Rath"}}]}},"_links":{"self":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/40595","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/comments?post=40595"}],"version-history":[{"count":94,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/40595\/revisions"}],"predecessor-version":[{"id":40700,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/40595\/revisions\/40700"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media\/40686"}],"wp:attachment":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media?parent=40595"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/categories?post=40595"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/tags?post=40595"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}