DebuggerCafe - Deep Learning, Machine Learning, Artificial Intelligence

SmolVLM: Accessible Image Captioning with Small Vision Language Model

In this article, we cover the SmolVLM model by Hugging Face. It is a compact 2.2B parameter model for vision understanding. ...

Gradio Application using Qwen2.5-VL

Sovit Ranjan Rath May 5, 2025 4 Comments

In this article, we build a simple Gradio application with Qwen2.5-VL for image captioning, video captioning, and object detection. ...

Qwen2.5-VL: Architecture, Benchmarks and Inference

Sovit Ranjan Rath April 28, 2025 3 Comments

In this article, we explore Qwen2.5-VL using Hugging Face Transformers. We cover the Qwen2.5-VL architecture, data preparation, benchmark, and inference. ...

Phi-4 Mini and Phi-4 Multimodal

Sovit Ranjan Rath April 21, 2025 0 Comments

In this article, we cover the Phi-4 Mini model. We start with the discussion of the architecture and create simple Gradio application for Phi-4 Mini Instruct and Phi-4 Multimodal models. ...

ViTPose – Human Pose Estimation with Vision Transformer

Sovit Ranjan Rath April 14, 2025 0 Comments

In this article, we cover the architecture of ViTPose and ViTPose++ and run inference on images & videos using ViTPose. ...