{"id":38766,"date":"2024-12-09T06:00:00","date_gmt":"2024-12-09T00:30:00","guid":{"rendered":"https:\/\/debuggercafe.com\/?p=38766"},"modified":"2025-06-02T19:59:58","modified_gmt":"2025-06-02T14:29:58","slug":"fastervit-detection","status":"publish","type":"post","link":"https:\/\/debuggercafe.com\/fastervit-detection\/","title":{"rendered":"FasterViT Detection"},"content":{"rendered":"\n<p>In this article, we will build the <strong><em>FasterViT Detection model<\/em><\/strong>. The primary aim is to create a single stage object detection model from a Vision Transformer backbone. We will use the pretrained FasterViT backbone from NVIDIA, add an SSD head from Torchvision, and train the model on the Pascal VOC object detection dataset.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-horizontal is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-499968f5 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background wp-element-button\" href=\"#download-code\"><strong>Jump to Download Code<\/strong><\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-output-demo.gif\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"338\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-output-demo.gif\" alt=\"FasterViT Detection model - demo output.\" class=\"wp-image-38859\"\/><\/a><figcaption class=\"wp-element-caption\">Figure 1. FasterViT Detection model &#8211; demo output.<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\"><em>Primarily, the article covers the following topics<\/em><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>A brief background on Vision Transformer object detection models.<\/em><\/li>\n\n\n\n<li><em>Modifications that we need to make to the FasterViT backbone to create a Transformer based object detection model.<\/em><\/li>\n\n\n\n<li><em>Data loading and augmentation pipeline.<\/em><\/li>\n\n\n\n<li><em>Training the FasterViT Detection model, running evaluation, and inference on unseen data. <\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Background on Vision Transformer Based Object Detection Models<\/h2>\n\n\n\n<p>Since the advent of Vision Transformer models, we have seen their applications in several tasks. Image classification, semantic segmentation, object detection,  and many industrial applications as well. Often, libraries like MMDetection and Detectron2 provide Transformer based object detection models. <\/p>\n\n\n\n<p>Detectron2 has the famous <a href=\"https:\/\/debuggercafe.com\/pretraining-faster-rcnn-vit-detection-model-on-pascal-voc\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>ViTDet model<\/strong><\/a> and MMDetection has Transformer based detection models as well. But most of these are based on MaskRCNN heads (instance segmentation) and are not real time on commodity hardware.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"480\" style=\"aspect-ratio: 854 \/ 480;\" width=\"854\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/vitdet_low_fps.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 1. ViTDet model running at less then 24 FPS (non real-time) on an RTX 3080 GPU.<\/figcaption><\/figure>\n\n\n\n<p>Although libraries like Ultralytics have RTDETR integration, it is not easy to explore the codebase.<\/p>\n\n\n\n<p><em>What if we want to create a real-time object detection model with a Single Stage Object Detection head?<\/em> That&#8217;s where pretrained <strong><a href=\"https:\/\/debuggercafe.com\/vision-transformer-from-scratch\/\" target=\"_blank\" rel=\"noreferrer noopener\">Vision Transformer<\/a><\/strong> backbones and Torchvision detection utilities come into the picture. <\/p>\n\n\n\n<p>We will modify the pretrained FasterViT backbone along with the <strong><a href=\"https:\/\/debuggercafe.com\/ssd300-vgg16-backbone-object-detection-with-pytorch-and-torchvision\/\" target=\"_blank\" rel=\"noreferrer noopener\">Torchvision SSD<\/a><\/strong> head to create a real-time Vision Transformer object detection model.<\/p>\n\n\n\n<p>The codebase will be easy to explore and modify. Although we are not building everything from scratch, being able to see and work with creating such an object detection model, training, and evaluation will lead to a lot of learning.<\/p>\n\n\n\n<p><strong><em>It is worthwhile to note that although we get decent results, they are not state-of-the-art. We will primarily aim to create a Vision Transformer object detection model partially from scratch and work with the code.<\/em><\/strong><\/p>\n\n\n\n<p>Before moving further, I highly recommend reviewing the <strong><a href=\"https:\/\/debuggercafe.com\/fastervit-for-image-classification\/\" target=\"_blank\" rel=\"noreferrer noopener\">FasterViT image classification article<\/a><\/strong>. In the article, we cover the FasterViT model from NVIDIA, its variants, the results, and carry out image classification.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Pascal VOC Object Detection Dataset<\/h2>\n\n\n\n<p>We will train the FasterViT detection model on the Pascal VOC dataset. The dataset contains <strong>16551 images for training<\/strong> and <strong>4952 images for validation<\/strong> across 20 object classes.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"pascal_voc_detection_classes\" data-enlighter-group=\"pascal_voc_detection_classes_1\">[\n'background',\n\"aeroplane\", \"bicycle\", \"bird\", \"boat\", \"bottle\", \"bus\", \"car\", \"cat\",\n\"chair\", \"cow\", \"diningtable\", \"dog\", \"horse\", \"motorbike\", \"person\",\n\"pottedplant\", \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n]<\/pre>\n\n\n\n<p>You can download the dataset <strong><a href=\"https:\/\/www.kaggle.com\/datasets\/sovitrath\/pascal-voc-07-12\" target=\"_blank\" rel=\"noreferrer noopener\">from Kaggle<\/a><\/strong>. After extracting, you will find the following directory structure.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">voc_07_12\/\n\u2514\u2500\u2500 final_xml_dataset\n    \u251c\u2500\u2500 train\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 images [16551 entries exceeds filelimit, not opening dir]\n    \u2502\u00a0\u00a0 \u2514\u2500\u2500 labels [16551 entries exceeds filelimit, not opening dir]\n    \u251c\u2500\u2500 valid\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 images [4952 entries exceeds filelimit, not opening dir]\n    \u2502\u00a0\u00a0 \u2514\u2500\u2500 labels [4952 entries exceeds filelimit, not opening dir]\n    \u2514\u2500\u2500 README.txt<\/pre>\n\n\n\n<p>We have a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">train<\/code> and a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">valid<\/code> directory with respective subdirectories for images and labels in XML format.<\/p>\n\n\n\n<p>Here are some samples from the dataset.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/pascal-voc-ground-truth-data.png\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"535\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/pascal-voc-ground-truth-data.png\" alt=\"Figure 2. Ground truth images and labels from the Pascal VOC dataset.\" class=\"wp-image-38871\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/pascal-voc-ground-truth-data.png 800w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/pascal-voc-ground-truth-data-300x201.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/pascal-voc-ground-truth-data-768x514.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 2. Ground truth images and labels from the Pascal VOC dataset.<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Project Directory Structure<\/h2>\n\n\n\n<p>Let&#8217;s take a look at the entire project&#8217;s directory structure.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\u251c\u2500\u2500 data\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 inference_data\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 voc_07_12\n\u251c\u2500\u2500 inference_outputs\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 images\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 videos\n\u251c\u2500\u2500 outputs\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 best_model.pth\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 last_model.pth\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 map.png\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 train_loss.png\n\u251c\u2500\u2500 weights\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 faster_vit_0.pth.tar\n\u251c\u2500\u2500 config.py\n\u251c\u2500\u2500 custom_utils.py\n\u251c\u2500\u2500 datasets.py\n\u251c\u2500\u2500 eval.py\n\u251c\u2500\u2500 inference.py\n\u251c\u2500\u2500 inference_video.py\n\u251c\u2500\u2500 model.py\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 requirements.txt\n\u2514\u2500\u2500 train.py<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">data<\/code> directory contains the Pascal VOC dataset that we downloaded earlier and also inference data.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">inference_outputs<\/code> directory contains the results from carrying out inference after training the model.<\/li>\n\n\n\n<li>In the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">outputs<\/code> directory, we have the trained model weights and the plots for Mean Average Precision and loss.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">weights<\/code> directory contains the ImageNet pretrained weights from FasterViT-0 that we downloaded from the <strong><a href=\"https:\/\/github.com\/NVlabs\/FasterViT\" target=\"_blank\" rel=\"noreferrer noopener\">official repository<\/a><\/strong>.<\/li>\n\n\n\n<li>The parent project directory contains all the code files, along with a README, and requirements file.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#ffb76a\"><strong><em>The download section allows you to download the Python code files, FasterViT-0 pretrained weights, inference, data, and the best weights for Pascal VOC training. In case you follow along with the training, please download the Pascal VOC and arrange it in the directory structure as shown above.<\/em><\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"download-code\">Download Code<\/h3>\n\n\n\n<div class=\"wp-block-button is-style-outline center\"><a data-sumome-listbuilder-id=\"dc14a516-c4f6-4672-ac43-3171fc6cb2c7\" class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background\"><b>Download the Source Code for this Tutorial<\/b><\/a><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Installing Dependencies<\/h3>\n\n\n\n<p>We are using PyTorch as the deep learning framework here. You can install all the requirements using the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">requirements.txt<\/code> file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install -r requirements.txt<\/pre>\n\n\n\n<p>Although it installs PyTorch 2.1.2 and Torchvision 0.16.2, if you choose to work with the latest versions, it should work.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Faster ViT Detection for Custom Transformer Based Object Detection Model<\/h2>\n\n\n\n<p>We will explore some of the important components of the codebase here. We will start with the most crucial component, the FasterViT detection model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">FasterViT Detection Model<\/h3>\n\n\n\n<p>Here, we will explore the changes and additions that we need to make to the base FasterViT-0 model to make it object detection compatible.<\/p>\n\n\n\n<p>In one of our previous articles, we created a <strong><a href=\"https:\/\/debuggercafe.com\/fastervit-for-semantic-segmentation\/\" target=\"_blank\" rel=\"noreferrer noopener\">FasterViT semantic segmentation<\/a><\/strong> model. Further, we have also explored training the <strong><a href=\"https:\/\/debuggercafe.com\/training-fastervit-on-voc-segmentation-dataset\/\" target=\"_blank\" rel=\"noreferrer noopener\">FasterViT on Pascal VOC segmentation dataset<\/a><\/strong>. I am sure that both of these previous articles will help get a better understanding of the backbone.<\/p>\n\n\n\n<p>The code for the detection model resides in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">model.py<\/code> file. It is almost entirely borrowed from the official repository so that we can make any changes that we want. The file is <strong><em>more than 1000 lines of code<\/em><\/strong>, so, we will cover the most important components only.<\/p>\n\n\n\n<p>We start with some minor changes to the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">__init__<\/code> method of the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">FasterViT<\/code> class.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class FasterViT(nn.Module):\n    \"\"\"\n    FasterViT based on: \"Hatamizadeh et al.,\n    FasterViT: Fast Vision Transformers with Hierarchical Attention\n    \"\"\"\n\n    def __init__(self,\n                 dim,\n                 in_dim,\n                 depths,\n                 window_size,\n                 ct_size,\n                 mlp_ratio,\n                 num_heads,\n                 resolution=[224, 224],\n                 drop_path_rate=0.2,\n                 in_chans=3,\n                 num_classes=1000,\n                 qkv_bias=True,\n                 qk_scale=None,\n                 drop_rate=0.,\n                 attn_drop_rate=0.,\n                 layer_scale=None,\n                 layer_scale_conv=None,\n                 layer_norm_last=False,\n                 hat=[False, False, True, False],\n                 do_propagation=False,\n                 **kwargs):\n        \"\"\"\n        Args:\n            dim: feature size dimension.\n            in_dim: inner-plane feature size dimension.\n            depths: layer depth.\n            window_size: window size.\n            ct_size: spatial dimension of carrier token local window.\n            mlp_ratio: MLP ratio.\n            num_heads: number of attention head.\n            resolution: image resolution.\n            drop_path_rate: drop path rate.\n            in_chans: input channel dimension.\n            num_classes: number of classes.\n            qkv_bias: bool argument for query, key, value learnable bias.\n            qk_scale: bool argument to scaling query, key.\n            drop_rate: dropout rate.\n            attn_drop_rate: attention dropout rate.\n            layer_scale: layer scale coefficient.\n            layer_scale_conv: conv layer scale coefficient.\n            layer_norm_last: last stage layer norm flag.\n            hat: hierarchical attention flag.\n            do_propagation: enable carrier token propagation.\n        \"\"\"\n        super().__init__()\n        if type(resolution)!=tuple and type(resolution)!=list:\n            resolution = [resolution, resolution]\n        num_features = int(dim * 2 ** (len(depths) - 1))\n        self.num_classes = num_classes\n        self.patch_embed = PatchEmbed(in_chans=in_chans, in_dim=in_dim, dim=dim)\n        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]\n        self.levels = nn.ModuleList()\n        if hat is None: hat = [True, ]*len(depths)\n        for i in range(len(depths)):\n            conv = True if (i == 0 or i == 1) else False\n            level = FasterViTLayer(dim=int(dim * 2 ** i),\n                                   depth=depths[i],\n                                   num_heads=num_heads[i],\n                                   window_size=window_size[i],\n                                   ct_size=ct_size,\n                                   mlp_ratio=mlp_ratio,\n                                   qkv_bias=qkv_bias,\n                                   qk_scale=qk_scale,\n                                   conv=conv,\n                                   drop=drop_rate,\n                                   attn_drop=attn_drop_rate,\n                                   drop_path=dpr[sum(depths[:i]):sum(depths[:i + 1])],\n                                   downsample=(i &lt; 3),\n                                   layer_scale=layer_scale,\n                                   layer_scale_conv=layer_scale_conv,\n                                   input_resolution=[int(2 ** (-2 - i) * resolution[0]), \n                                                     int(2 ** (-2 - i) * resolution[1])],\n                                   only_local=not hat[i],\n                                   do_propagation=do_propagation)\n            self.levels.append(level)\n            \n        self.norm = LayerNorm2d(num_features) if layer_norm_last else nn.BatchNorm2d(num_features)\n        self.avgpool = nn.AdaptiveAvgPool2d(1)\n        self.head = nn.Linear(num_features, num_classes) if num_classes > 0 else nn.Identity()\n        self.apply(self._init_weights)\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            trunc_normal_(m.weight, std=.02)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n        elif isinstance(m, LayerNorm2d):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n        elif isinstance(m, nn.BatchNorm2d):\n            nn.init.ones_(m.weight)\n            nn.init.zeros_(m.bias)\n\n    @torch.jit.ignore\n    def no_weight_decay_keywords(self):\n        return {'rpb'}\n\n    def forward_features(self, x):\n        x = self.patch_embed(x)\n        for level in self.levels:\n            x = level(x)\n\n        x = self.norm(x)\n        # Return both, the final output, and the convolution feature.\n        return x\n    \n    def forward_head(self, x):\n        x = self.avgpool(x)\n        x = torch.flatten(x, 1)\n        x = self.head(x)\n        return x\n\n    def forward(self, x):\n        # Need only the forwarded features and not from the head part\n        # that is meant for classification.\n        x = self.forward_features(x)\n\n        # return final_features\n        return x\n    \n    def _load_state_dict(self, \n                         pretrained, \n                         strict: bool = False):\n        _load_checkpoint(self, \n                         pretrained, \n                         strict=strict)<\/pre>\n\n\n\n<p>The FasterViT class remains almost the same as the original model. Of course, all the layers that create the FasterViT backbone are defined before this in the file. It is highly recommended to take a thorough look through the code at least once. All in all, initializing the above class provides us with the backbone features. <\/p>\n\n\n\n<p>That brings us to the next custom function.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def faster_vit_0_any_res(pretrained=False, **kwargs):\n    depths = kwargs.pop(\"depths\", [2, 3, 6, 5])\n    num_heads = kwargs.pop(\"num_heads\", [2, 4, 8, 16])\n    window_size = kwargs.pop(\"window_size\", [7, 7, 7, 7])\n    ct_size = kwargs.pop(\"ct_size\", 2)\n    dim = kwargs.pop(\"dim\", 64)\n    in_dim = kwargs.pop(\"in_dim\", 64)\n    mlp_ratio = kwargs.pop(\"mlp_ratio\", 4)\n    resolution = kwargs.pop(\"resolution\", [512, 512])\n    drop_path_rate = kwargs.pop(\"drop_path_rate\", 0.2)\n    model_path = kwargs.pop(\"model_path\", \"weights\/faster_vit_0.pth.tar\")\n    hat = kwargs.pop(\"hat\", [False, False, True, False])\n    num_classes = kwargs.pop('num_classes', 2)\n    nms = kwargs.pop('nms', 0.45)\n\n    pretrained_cfg = resolve_pretrained_cfg('faster_vit_0_any_res').to_dict()\n    _update_default_model_kwargs(pretrained_cfg, kwargs, kwargs_filter=None)\n\n    backbone_model = FasterViT(depths=depths,\n                      num_heads=num_heads,\n                      window_size=window_size,\n                      ct_size=ct_size,\n                      dim=dim,\n                      in_dim=in_dim,\n                      mlp_ratio=mlp_ratio,\n                      resolution=resolution,\n                      drop_path_rate=drop_path_rate,\n                      hat=hat,\n                      **kwargs)\n    \n    backbone_model.pretrained_cfg = pretrained_cfg\n    backbone_model.default_cfg = backbone_model.pretrained_cfg\n    if pretrained:\n        if not Path(model_path).is_file():\n            url = backbone_model.default_cfg['url']\n            torch.hub.download_url_to_file(url=url, dst=model_path)\n        backbone_model._load_state_dict(model_path)\n\n    backbone = nn.Sequential(backbone_model)\n\n    out_channels = [512, 512, 512, 512, 512, 512]\n    anchor_generator = DefaultBoxGenerator(\n        [[2], [2, 3], [2, 3], [2, 3], [2], [2]],\n    )\n    num_anchors = anchor_generator.num_anchors_per_location()\n    head = SSDHead(out_channels, num_anchors, num_classes=num_classes)\n    model = SSD(\n        backbone=backbone,\n        num_classes=num_classes,\n        anchor_generator=anchor_generator,\n        size=resolution,\n        head=head,\n        nms_thresh=nms\n    )\n    return model<\/pre>\n\n\n\n<p>We combine everything in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">faster_vit_0_any_res<\/code> function.<\/p>\n\n\n\n<p>It accepts a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">pretrained<\/code> parameter and several keyword parameters. The keyword parameters define the model hyperparameters such as the model depth, number of heads, window size, head dimension, expansion ratio, and model resolution among others.<\/p>\n\n\n\n<p>We build the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">backbone_model<\/code> by initializing the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">FasterViT<\/code> class to extract the pretrained configuration and load the pretrained state dictionary from the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">weights<\/code> directory.<\/p>\n\n\n\n<p>Next, we create a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">Sequential<\/code> model from the backbone.<\/p>\n\n\n\n<p>As we are building an <strong><a href=\"https:\/\/debuggercafe.com\/object-detection-using-ssd300-resnet50-and-pytorch\/\" target=\"_blank\" rel=\"noreferrer noopener\">SSD model<\/a><\/strong>, we need to define the output channels for each of the SSD heads. This is followed by the anchor generator, and initializing <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">SSDHead<\/code> itself. We use the outputs from the final batch normalization layer of the backbone with 512 dimensional output and feed it to the SSD head.<\/p>\n\n\n\n<p>This completes the process of creating the final model.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Sanity Check for Our FasterViT Detection Model<\/h4>\n\n\n\n<p>Let&#8217;s create a main block, initialize our model, and do a dummy forward pass through the model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">if __name__ == '__main__':\n    resolution = [512, 512]\n    model = faster_vit_0_any_res(\n        pretrained=True,\n        num_classes=8,\n        resolution=resolution,\n        nms=0.45\n    )\n\n    torchinfo.summary(\n        model, \n        device='cpu', \n        input_size=[1, 3, resolution[0], resolution[1]],\n        row_settings=[\"var_names\"],\n        col_names=(\"input_size\", \"output_size\", \"num_params\") \n    )\n    \n    # Total parameters and trainable parameters.\n    total_params = sum(p.numel() for p in model.parameters())\n    print(f\"{total_params:,} total parameters.\")\n    total_trainable_params = sum(\n        p.numel() for p in model.parameters() if p.requires_grad)\n    print(f\"{total_trainable_params:,} training parameters.\")\n    \n    random_input = torch.randn((2, 3, *resolution))\n\n    model.eval()\n    with torch.no_grad():\n        outputs = model(random_input)\n\n    print(outputs[0]['boxes'].shape)<\/pre>\n\n\n\n<p>We can execute the model file using:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python model.py<\/pre>\n\n\n\n<p>This is the output that we get on the terminal.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">size mismatch for levels.2.blocks.0.hat_attn.pos_emb_funct.relative_coords_table: copying a param with shape torch.Size([1, 7, 7, 2]) from checkpoint, the shape in current model is torch.Size([1, 19, 19, 2]).\nsize mismatch for levels.2.blocks.0.hat_attn.pos_emb_funct.relative_position_index: copying a param with shape torch.Size([16, 16]) from checkpoint, the shape in current model is torch.Size([100, 100]).\nsize mismatch for levels.2.blocks.0.hat_attn.pos_emb_funct.relative_bias: copying a param with shape torch.Size([1, 8, 16, 16]) from checkpoint, the shape in current model is torch.Size([1, 8, 100, 100]).\nsize mismatch for levels.2.blocks.0.hat_pos_embed.relative_bias: copying a param with shape torch.Size([1, 16, 256]) from checkpoint, the shape in current model is torch.Size([1, 100, 256]).\nsize mismatch for levels.2.blocks.1.hat_attn.pos_emb_funct.relative_coords_table: copying a param with shape torch.Size([1, 7, 7, 2]) from checkpoint, the shape in current model is torch.Size([1, 19, 19, 2]).\nsize mismatch for levels.2.blocks.1.hat_attn.pos_emb_funct.relative_position_index: copying a param with shape torch.Size([16, 16]) from checkpoint, the shape in current model is torch.Size([100, 100]).\nsize mismatch for levels.2.blocks.1.hat_attn.pos_emb_funct.relative_bias: copying a param with shape torch.Size([1, 8, 16, 16]) from checkpoint, the shape in current model is torch.Size([1, 8, 100, 100]).\nsize mismatch for levels.2.blocks.1.hat_pos_embed.relative_bias: copying a param with shape torch.Size([1, 16, 256]) from checkpoint, the shape in current model is torch.Size([1, 100, 256]).\nsize mismatch for levels.2.blocks.2.hat_attn.pos_emb_funct.relative_coords_table: copying a param with shape torch.Size([1, 7, 7, 2]) from checkpoint, the shape in current model is torch.Size([1, 19, 19, 2]).\nsize mismatch for levels.2.blocks.2.hat_attn.pos_emb_funct.relative_position_index: copying a param with shape torch.Size([16, 16]) from checkpoint, the shape in current model is torch.Size([100, 100]).\nsize mismatch for levels.2.blocks.2.hat_attn.pos_emb_funct.relative_bias: copying a param with shape torch.Size([1, 8, 16, 16]) from checkpoint, the shape in current model is torch.Size([1, 8, 100, 100]).\nsize mismatch for levels.2.blocks.2.hat_pos_embed.relative_bias: copying a param with shape torch.Size([1, 16, 256]) from checkpoint, the shape in current model is torch.Size([1, 100, 256]).\nsize mismatch for levels.2.blocks.3.hat_attn.pos_emb_funct.relative_coords_table: copying a param with shape torch.Size([1, 7, 7, 2]) from checkpoint, the shape in current model is torch.Size([1, 19, 19, 2]).\nsize mismatch for levels.2.blocks.3.hat_attn.pos_emb_funct.relative_position_index: copying a param with shape torch.Size([16, 16]) from checkpoint, the shape in current model is torch.Size([100, 100]).\nsize mismatch for levels.2.blocks.3.hat_attn.pos_emb_funct.relative_bias: copying a param with shape torch.Size([1, 8, 16, 16]) from checkpoint, the shape in current model is torch.Size([1, 8, 100, 100]).\nsize mismatch for levels.2.blocks.3.hat_pos_embed.relative_bias: copying a param with shape torch.Size([1, 16, 256]) from checkpoint, the shape in current model is torch.Size([1, 100, 256]).\nsize mismatch for levels.2.blocks.4.hat_attn.pos_emb_funct.relative_coords_table: copying a param with shape torch.Size([1, 7, 7, 2]) from checkpoint, the shape in current model is torch.Size([1, 19, 19, 2]).\nsize mismatch for levels.2.blocks.4.hat_attn.pos_emb_funct.relative_position_index: copying a param with shape torch.Size([16, 16]) from checkpoint, the shape in current model is torch.Size([100, 100]).\nsize mismatch for levels.2.blocks.4.hat_attn.pos_emb_funct.relative_bias: copying a param with shape torch.Size([1, 8, 16, 16]) from checkpoint, the shape in current model is torch.Size([1, 8, 100, 100]).\nsize mismatch for levels.2.blocks.4.hat_pos_embed.relative_bias: copying a param with shape torch.Size([1, 16, 256]) from checkpoint, the shape in current model is torch.Size([1, 100, 256]).\nsize mismatch for levels.2.blocks.5.hat_attn.pos_emb_funct.relative_coords_table: copying a param with shape torch.Size([1, 7, 7, 2]) from checkpoint, the shape in current model is torch.Size([1, 19, 19, 2]).\nsize mismatch for levels.2.blocks.5.hat_attn.pos_emb_funct.relative_position_index: copying a param with shape torch.Size([16, 16]) from checkpoint, the shape in current model is torch.Size([100, 100]).\nsize mismatch for levels.2.blocks.5.hat_attn.pos_emb_funct.relative_bias: copying a param with shape torch.Size([1, 8, 16, 16]) from checkpoint, the shape in current model is torch.Size([1, 8, 100, 100]).\nsize mismatch for levels.2.blocks.5.hat_pos_embed.relative_bias: copying a param with shape torch.Size([1, 16, 256]) from checkpoint, the shape in current model is torch.Size([1, 100, 256]).\n======================================================================================================================================================\nLayer (type (var_name))                                                     Input Shape               Output Shape              Param #\n======================================================================================================================================================\nSSD (SSD)                                                                   [1, 3, 512, 512]          [200, 4]                  --\n\u251c\u2500GeneralizedRCNNTransform (transform)                                      [1, 3, 512, 512]          [1, 3, 512, 512]          --\n\u251c\u2500Sequential (backbone)                                                     [1, 3, 512, 512]          [1, 512, 16, 16]          --\n\u2502    \u2514\u2500FasterViT (0)                                                        [1, 3, 512, 512]          [1, 512, 16, 16]          513,000\n\u2502    \u2502    \u2514\u2500PatchEmbed (patch_embed)                                        [1, 3, 512, 512]          [1, 64, 128, 128]         38,848\n\u2502    \u2502    \u2514\u2500ModuleList (levels)                                             --                        --                        30,851,968\n\u2502    \u2502    \u2514\u2500BatchNorm2d (norm)                                              [1, 512, 16, 16]          [1, 512, 16, 16]          1,024\n\u251c\u2500SSDHead (head)                                                            [1, 512, 16, 16]          [1, 1024, 21]             --\n\u2502    \u2514\u2500SSDRegressionHead (regression_head)                                  [1, 512, 16, 16]          [1, 1024, 4]              --\n\u2502    \u2502    \u2514\u2500ModuleList (module_list)                                        --                        --                        553,080\n\u2502    \u2514\u2500SSDClassificationHead (classification_head)                          [1, 512, 16, 16]          [1, 1024, 21]             --\n\u2502    \u2502    \u2514\u2500ModuleList (module_list)                                        --                        --                        2,903,670\n\u251c\u2500DefaultBoxGenerator (anchor_generator)                                    [1, 3, 512, 512]          [1024, 4]                 --\n======================================================================================================================================================\nTotal params: 34,861,590\nTrainable params: 34,861,590\nNon-trainable params: 0\nTotal mult-adds (G): 17.01\n======================================================================================================================================================\nInput size (MB): 3.15\nForward\/backward pass size (MB): 543.37\nParams size (MB): 125.41\nEstimated Total Size (MB): 671.93\n======================================================================================================================================================\n34,861,590 total parameters.\n34,861,590 training parameters.\ntorch.Size([200, 4])<\/pre>\n\n\n\n<p>The size mismatch for all the embedding layers happens because of the difference in input resolution. The model was pretrained with 224&#215;224 images and our input for the forward pass has 512&#215;512 tensors. However, the pretrained weights for the rest of the matching layers have been loaded.<\/p>\n\n\n\n<p>The final FasterViT detection model contains <strong>34.8 million parameters<\/strong> for 21 classes (similar to Pascal VOC).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Data Augmentation Pipeline<\/h3>\n\n\n\n<p>The data augmentation and image transformation code is present in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">custom_utils.py<\/code> file. As we are pretraining here, we employ several augmentation techniques for the training data using Albumentations. Here are all the training transforms:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def get_train_transform():\n    return A.Compose([\n        A.HorizontalFlip(p=0.5),\n        A.Blur(blur_limit=3, p=0.1),\n        A.MotionBlur(blur_limit=3, p=0.1),\n        A.MedianBlur(blur_limit=3, p=0.1),\n        A.ToGray(p=0.3),\n        A.RandomBrightnessContrast(p=0.3),\n        A.ColorJitter(p=0.3),\n        A.RandomGamma(p=0.3),\n        ToTensorV2(p=1.0),\n    ], bbox_params={\n        'format': 'pascal_voc',\n        'label_fields': ['labels']\n    })<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Training Configuration<\/h3>\n\n\n\n<p>We define all the training configurations in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">config.py<\/code> file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import torch\n\nBATCH_SIZE = 16 # Increase \/ decrease according to GPU memeory.\nRESIZE_TO = 640 # Resize the image for training and transforms.\nNUM_EPOCHS = 75 # Number of epochs to train for.\nNUM_WORKERS = 8 # Number of parallel workers for data loading.\n\nDEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n\n# Training images and XML files directory.\nTRAIN_IMG = 'data\/voc_07_12\/final_xml_dataset\/train\/images'\nTRAIN_ANNOT = 'data\/voc_07_12\/final_xml_dataset\/train\/labels'\n# Validation images and XML files directory.\nVALID_IMG = 'data\/voc_07_12\/final_xml_dataset\/valid\/images'\nVALID_ANNOT = 'data\/voc_07_12\/final_xml_dataset\/valid\/labels'\n\n# Classes: 0 index is reserved for background.\nCLASSES = [\n    '__background__',\n    \"aeroplane\", \"bicycle\", \"bird\", \"boat\", \"bottle\", \"bus\", \"car\", \"cat\",\n    \"chair\", \"cow\", \"diningtable\", \"dog\", \"horse\", \"motorbike\", \"person\",\n    \"pottedplant\", \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n]\n\nNUM_CLASSES = len(CLASSES)\n\n# Whether to visualize images after crearing the data loaders.\nVISUALIZE_TRANSFORMED_IMAGES = False\n\n# Location to save model and plots.\nOUT_DIR = 'outputs'<\/pre>\n\n\n\n<p>You can adjust the batch size and number of workers based on the hardware that you are training on.<\/p>\n\n\n\n<p>This brings us to the end of the coding part and we can begin training now.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Training the FasterViT Detection Model<\/h3>\n\n\n\n<p>The results for the training run shown here were carried out on a machine with a <strong>10GB virtualized A100 GPU<\/strong>. <em>It took around 9 hours to train for 75 epochs<\/em>.<\/p>\n\n\n\n<p>We can begin the training by simply executing the following command:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python train.py <\/pre>\n\n\n\n<p>Following are the loss and <strong><a href=\"https:\/\/debuggercafe.com\/evaluation-metrics-for-object-detection\/\" target=\"_blank\" rel=\"noreferrer noopener\">Mean Average Precision<\/a><\/strong> metrics graphs after training.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-map.png\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"700\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-map.png\" alt=\"mAP graph after training the FasterViT Detection model.\" class=\"wp-image-38875\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-map.png 1000w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-map-300x210.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-map-768x538.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3. mAP graph after training the FasterViT Detection model.<\/figcaption><\/figure>\n<\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-train_loss.png\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"700\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-train_loss.png\" alt=\"Training loss graph from the FasterViT Detection experiment.\" class=\"wp-image-38877\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-train_loss.png 1000w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-train_loss-300x210.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit-detection-train_loss-768x538.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4. Training loss graph from the FasterViT Detection experiment.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>We can clearly see that the mAP starts deteriorating after around 25 epochs. We are already employing a good amount of augmentation techniques. So, in the next phase of training, learning rate scheduler will surely help.<\/p>\n\n\n\n<p>The <strong>primary mAP is above 27%<\/strong> in our case. And the <strong>mAP at 50% IoU is above 60%<\/strong>.<\/p>\n\n\n\n<p>To get the exact numbers, we can run the evaluation script using the best model weights.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python eval.py<\/pre>\n\n\n\n<p>Following are the results.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">mAP_50: 61.226\nmAP_50_95: 27.771<\/pre>\n\n\n\n<p>We achieve a <strong>primary mAP of 27.7%<\/strong> using the best model. This is not extremely good but a decent starting point.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Running Inference on Unseen Data<\/h3>\n\n\n\n<p>We can use the best model weights to run inference on videos with the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">inference_video.py<\/code> script. It accepts an input video, a confidence threshold, and an optional image size.<\/p>\n\n\n\n<p>The following inference experiments were run on a <strong>10GB RTX 3080 GPU<\/strong>.<\/p>\n\n\n\n<p>Let&#8217;s start with a simple experiment to detect humans.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python inference_video.py --input data\/inference_data\/videos\/video_3.mp4 --threshold 0.7<\/pre>\n\n\n\n<p>Following is the result stored in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">inference_outputs<\/code> directory.<\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 640 \/ 360;\" width=\"640\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit_detection_video_3-1.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 2. Person detection using the trained FasterViT object detection model.<\/figcaption><\/figure>\n\n\n\n<p>The results are good enough here with a bit of flickering. We are getting an average of <strong>46 FPS<\/strong>.<\/p>\n\n\n\n<p>Now, let&#8217;s try on a slightly complex scene.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python inference_video.py --input data\/inference_data\/videos\/video_1.mp4 --threshold 0.7<\/pre>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 640 \/ 360;\" width=\"640\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit_detection_video_1-1.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 3. Object detection in a traffic scenario using the FasterViT Detection model.<\/figcaption><\/figure>\n\n\n\n<p>The results are decent, however, there is a lot of flickering and the model fails to detect faraway objects as well.<\/p>\n\n\n\n<p>Let&#8217;s run another experiment on a much more difficult scene.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python inference_video.py --input data\/inference_data\/videos\/video_2.mp4 --threshold 0.5<\/pre>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"540\" style=\"aspect-ratio: 960 \/ 540;\" width=\"960\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/fastervit_detection_video_2.mp4\"><\/video><figcaption class=\"wp-element-caption\">Video 4. Here, we are attempting person detection in a crowded scenario using the trained model. However, the FasterViT Detection model is unable to do so.<\/figcaption><\/figure>\n\n\n\n<p>No doubt the model fails here. The training dataset does not contain such crowded scenes and the lighting is challenging as well.<\/p>\n\n\n\n<p>The above results show that there is room for extreme improvement.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary and Conclusion<\/h2>\n\n\n\n<p>In this article, we created a custom FasterViT Detection model using the FasterViT-0 backbone. We went through the code for preparing the backbone and attaching an SSD head. The results are decent at most and not compelling enough for a 34 million parameter model. However, we can improve the architecture and training pipeline which we may explore in one of the future articles. I hope this article was worth your time.<\/p>\n\n\n\n<p>If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.<\/p>\n\n\n\n<p>You can contact me using the <strong><a aria-label=\"Contact (opens in a new tab)\" href=\"https:\/\/debuggercafe.com\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">Contact<\/a><\/strong> section. You can also find me on <strong><a aria-label=\"LinkedIn (opens in a new tab)\" href=\"https:\/\/www.linkedin.com\/in\/sovit-rath\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a><\/strong>, and <strong><a href=\"https:\/\/x.com\/SovitRath5\" target=\"_blank\" rel=\"noreferrer noopener\">X<\/a><\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/arxiv.org\/abs\/2306.06189\" target=\"_blank\" rel=\"noreferrer noopener\">FasterViT: Fast Vision Transformers with Hierarchical Attention<\/a><\/strong><\/li>\n\n\n\n<li><strong><a href=\"https:\/\/github.com\/NVlabs\/FasterViT\" target=\"_blank\" rel=\"noreferrer noopener\">NVlabs FasterViT<\/a><\/strong><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we create a custom Vision Transformer based object detection model using NVIDIA&#8217;s FasterViT backbone and the Single Shot Detection head.<\/p>\n","protected":false},"author":1,"featured_media":38895,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[120,90,336,529],"tags":[1083,1085,1082,1084,1086,1346,1081,1080],"class_list":["post-38766","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-object-detection","category-pytorch","category-single-shot-detection","category-vision-transformer","tag-custom-transformer-object-detection","tag-fastervit-backbone-ssd-head","tag-fastervit-object-detection","tag-fastervit-pascal-voc","tag-fastervit-single-shot-detection","tag-pytorch-fastervit-detection","tag-transformer-detection","tag-vision-transformer-detection"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>FasterViT Detection Training on Pascal VOC<\/title>\n<meta name=\"description\" content=\"FasterViT Detection model using NVIDIA&#039;s FasterViT backbone and SSD head trained on the Pascal VOC dataset.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/debuggercafe.com\/fastervit-detection\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"FasterViT Detection Training on Pascal VOC\" \/>\n<meta property=\"og:description\" content=\"FasterViT Detection model using NVIDIA&#039;s FasterViT backbone and SSD head trained on the Pascal VOC dataset.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/debuggercafe.com\/fastervit-detection\/\" \/>\n<meta property=\"og:site_name\" content=\"DebuggerCafe\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/profile.php?id=100013731104496\" \/>\n<meta property=\"article:published_time\" content=\"2024-12-09T00:30:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-02T14:29:58+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png\" \/>\n\t<meta property=\"og:image:width\" content=\"640\" \/>\n\t<meta property=\"og:image:height\" content=\"360\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sovit Ranjan Rath\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:site\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sovit Ranjan Rath\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"17 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/\"},\"author\":{\"name\":\"Sovit Ranjan Rath\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"headline\":\"FasterViT Detection\",\"datePublished\":\"2024-12-09T00:30:00+00:00\",\"dateModified\":\"2025-06-02T14:29:58+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/\"},\"wordCount\":1670,\"commentCount\":0,\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png\",\"keywords\":[\"Custom Transformer Object Detection\",\"FasterViT Backbone SSD Head\",\"FasterViT Object Detection\",\"FasterViT Pascal VOC\",\"FasterViT Single Shot Detection\",\"PyTorch FasterViT Detection\",\"Transformer Detection\",\"Vision Transformer Detection\"],\"articleSection\":[\"Object Detection\",\"PyTorch\",\"Single Shot Detection\",\"Vision Transformer\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/debuggercafe.com\/fastervit-detection\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/\",\"url\":\"https:\/\/debuggercafe.com\/fastervit-detection\/\",\"name\":\"FasterViT Detection Training on Pascal VOC\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png\",\"datePublished\":\"2024-12-09T00:30:00+00:00\",\"dateModified\":\"2025-06-02T14:29:58+00:00\",\"author\":{\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"description\":\"FasterViT Detection model using NVIDIA's FasterViT backbone and SSD head trained on the Pascal VOC dataset.\",\"breadcrumb\":{\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/debuggercafe.com\/fastervit-detection\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage\",\"url\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png\",\"contentUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png\",\"width\":1000,\"height\":563,\"caption\":\"FasterViT Detection\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/debuggercafe.com\/fastervit-detection\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/debuggercafe.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"FasterViT Detection\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/debuggercafe.com\/#website\",\"url\":\"https:\/\/debuggercafe.com\/\",\"name\":\"DebuggerCafe\",\"description\":\"Machine Learning and Deep Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/debuggercafe.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\",\"name\":\"Sovit Ranjan Rath\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"caption\":\"Sovit Ranjan Rath\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"FasterViT Detection Training on Pascal VOC","description":"FasterViT Detection model using NVIDIA's FasterViT backbone and SSD head trained on the Pascal VOC dataset.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/debuggercafe.com\/fastervit-detection\/","og_locale":"en_US","og_type":"article","og_title":"FasterViT Detection Training on Pascal VOC","og_description":"FasterViT Detection model using NVIDIA's FasterViT backbone and SSD head trained on the Pascal VOC dataset.","og_url":"https:\/\/debuggercafe.com\/fastervit-detection\/","og_site_name":"DebuggerCafe","article_publisher":"https:\/\/www.facebook.com\/profile.php?id=100013731104496","article_published_time":"2024-12-09T00:30:00+00:00","article_modified_time":"2025-06-02T14:29:58+00:00","og_image":[{"url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png","width":640,"height":360,"type":"image\/png"}],"author":"Sovit Ranjan Rath","twitter_card":"summary_large_image","twitter_creator":"@SovitRath5","twitter_site":"@SovitRath5","twitter_misc":{"Written by":"Sovit Ranjan Rath","Est. reading time":"17 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/debuggercafe.com\/fastervit-detection\/#article","isPartOf":{"@id":"https:\/\/debuggercafe.com\/fastervit-detection\/"},"author":{"name":"Sovit Ranjan Rath","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"headline":"FasterViT Detection","datePublished":"2024-12-09T00:30:00+00:00","dateModified":"2025-06-02T14:29:58+00:00","mainEntityOfPage":{"@id":"https:\/\/debuggercafe.com\/fastervit-detection\/"},"wordCount":1670,"commentCount":0,"image":{"@id":"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png","keywords":["Custom Transformer Object Detection","FasterViT Backbone SSD Head","FasterViT Object Detection","FasterViT Pascal VOC","FasterViT Single Shot Detection","PyTorch FasterViT Detection","Transformer Detection","Vision Transformer Detection"],"articleSection":["Object Detection","PyTorch","Single Shot Detection","Vision Transformer"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/debuggercafe.com\/fastervit-detection\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/debuggercafe.com\/fastervit-detection\/","url":"https:\/\/debuggercafe.com\/fastervit-detection\/","name":"FasterViT Detection Training on Pascal VOC","isPartOf":{"@id":"https:\/\/debuggercafe.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage"},"image":{"@id":"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png","datePublished":"2024-12-09T00:30:00+00:00","dateModified":"2025-06-02T14:29:58+00:00","author":{"@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"description":"FasterViT Detection model using NVIDIA's FasterViT backbone and SSD head trained on the Pascal VOC dataset.","breadcrumb":{"@id":"https:\/\/debuggercafe.com\/fastervit-detection\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/debuggercafe.com\/fastervit-detection\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/fastervit-detection\/#primaryimage","url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png","contentUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/10\/FasterViT-Detection-e1729613084728.png","width":1000,"height":563,"caption":"FasterViT Detection"},{"@type":"BreadcrumbList","@id":"https:\/\/debuggercafe.com\/fastervit-detection\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/debuggercafe.com\/"},{"@type":"ListItem","position":2,"name":"FasterViT Detection"}]},{"@type":"WebSite","@id":"https:\/\/debuggercafe.com\/#website","url":"https:\/\/debuggercafe.com\/","name":"DebuggerCafe","description":"Machine Learning and Deep Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/debuggercafe.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752","name":"Sovit Ranjan Rath","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","caption":"Sovit Ranjan Rath"}}]}},"_links":{"self":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/38766","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/comments?post=38766"}],"version-history":[{"count":155,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/38766\/revisions"}],"predecessor-version":[{"id":39802,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/38766\/revisions\/39802"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media\/38895"}],"wp:attachment":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media?parent=38766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/categories?post=38766"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/tags?post=38766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}