In this article, we cover inference code for SmolVLM2. We carry out image and video inference experiments using SmolVLM2-2.2B-Instruct and SmolVLM2-256M-Instruct. ...
In this article, we explore Qwen2.5-Omni, a multimodal generative AI model that can accept text, image, video, and audio as inputs while outputting both text and audio. ...
In this article, we are fine-tuning the SmolVLM-256M model for receipt OCR on the SROIE v2 dataset after generating the ground truth data using QwenVL-2B model. ...
In this article, we explore Gemma 3. We start with the need for Gemma 3, its architecture and multimodal capabilities, and carry out inference using Hugging Face. ...
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.