{"id":35083,"date":"2024-04-01T06:00:00","date_gmt":"2024-04-01T00:30:00","guid":{"rendered":"https:\/\/debuggercafe.com\/?p=35083"},"modified":"2025-06-16T06:55:23","modified_gmt":"2025-06-16T01:25:23","slug":"multiscale-vision-transformer","status":"publish","type":"post","link":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/","title":{"rendered":"Multiscale Vision Transformer for Video Recognition"},"content":{"rendered":"\n<p>Vision transformers are already good at multiple tasks like image recognition, object detection, and semantic segmentation. However, we can also apply them to data with temporal information like videos. One such use case is using Vision Transformers for video classification. To this end, in this article, we will go over the important parts of the <strong>Multiscale Vision Transformer<\/strong> (MViT) paper and also carry out inference using the pretraining model.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-horizontal is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-499968f5 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background wp-element-button\" href=\"#download-code\"><strong>Jump to Download Code<\/strong><\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-bowling-output.gif\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"540\" height=\"304\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-bowling-output.gif\" alt=\"An example output after passing a bowling video through the Multiscale Vision Transformer model.\" class=\"wp-image-35141\"\/><\/a><figcaption class=\"wp-element-caption\">Figure 1. An example output after passing a bowling video through the Multiscale Vision Transformer model.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>Although there are several models for this, the Multiscale Vision Transformer model stands out for video recognition. Along with dealing with temporal data, it also uses multiscale features for video recognition.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><em>We will cover the following topics in this <\/em>article<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>In the first part, we will cover the important sections of the Multiscale Vision Transformer paper.<\/em>\n<ul class=\"wp-block-list\">\n<li><em>First, the drawbacks of other models and approaches.<\/em><\/li>\n\n\n\n<li><em>Second, the contributions and unique approach of the MViT model.<\/em><\/li>\n\n\n\n<li><em>Third, the architecture and implementation details of the MViT model.<\/em><\/li>\n\n\n\n<li><em>Finally, the results on the Kinetics-400 dataset and comparison with other models.<\/em><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><em>In the second part, we will code our way through using the MViT model pretrained on the Kinetics-400 dataset for video action recognition.<\/em><\/li>\n\n\n\n<li><em> Finally, we will discuss some further projects for fine-tuning the Multiscale Vision Transformer model for real-life use cases.<\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Multiscale Vision Transformer (MViT) Model<\/h2>\n\n\n\n<p>The MViT model was introduced in the paper <em>Multiscale Vision Transformers<\/em> by researchers from Facebook AI and UC Berkley.<\/p>\n\n\n\n<p>Although simple, the idea is powerful &#8211; <strong>use multiscale features to train a good video recognition model<\/strong>.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-different-scales.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"661\" height=\"285\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-different-scales.png\" alt=\"MViT learning from high and low resolution features of the image.\" class=\"wp-image-35143\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-different-scales.png 661w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-different-scales-300x129.png 300w\" sizes=\"auto, (max-width: 661px) 100vw, 661px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 2. MViT learning from high and low resolution features of the image.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The concept is to build a pyramid of feature hierarchies such that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The earlier layers of the MViT model can work on the high resolution spatial resolution to extract the finer features.<\/li>\n\n\n\n<li>And the lower layers can work on the smaller spatial resolution to extract the course yet complex features.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>There is one <strong>major drawback<\/strong> of previous models dealing with temporal data and video recognition. In general, transformer neural networks define a <strong>constant channel capacity<\/strong> (<strong>hidden dimension<\/strong>) throughout the network. This can affect learning features of an image at various levels. MViT tackles the issue from the architecture level.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The MViT Architecture<\/h3>\n\n\n\n<p>Following the above idea, the MViT architecture aligns its layers to create a heirarchy of pyramid features.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It starts with the input image resolution while keeping the channel dimension low.<\/li>\n\n\n\n<li>Eventually, the layers expand the channel dimension and reduce the spatial resolution, hierarchically.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>This provides dense visual concepts to the model along with the fine grained and coarse features. This also leads to the use of temporal information of the video effectively during inference. <\/p>\n\n\n\n<p>As the Multiscale Vision Transformer deals with various resolutions at various stages, the spatial resolution output will vary at each stage.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscalev-vision-transformer-stages-output-sizes.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"538\" height=\"350\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscalev-vision-transformer-stages-output-sizes.png\" alt=\"Different stages and output sizes of the Multiscale Vision Transformer model.\" class=\"wp-image-35144\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscalev-vision-transformer-stages-output-sizes.png 538w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscalev-vision-transformer-stages-output-sizes-300x195.png 300w\" sizes=\"auto, (max-width: 538px) 100vw, 538px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3. Different stages and output sizes of the Multiscale Vision Transformer model.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The above figure shows the dimensions (\\(D\\)) of the Multi-Head Attention and MLP layers at various scales. It also shows the output sizes of the features in the corresponding stages. <\/p>\n\n\n\n<p>We can see that each stage progressively increases the dimension while downsampling the spatial resolution.<\/p>\n\n\n\n<p>Along with architectural changes compared to the original Vision Transformer model, the MViT model also employs various new techniques:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Pooling operator and pooling attention<\/li>\n\n\n\n<li>Channel expansion<\/li>\n\n\n\n<li>Query pooling<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>I highly recommend going through <strong>section 3<\/strong> of the paper to learn about the above in detail.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Experiments and Results<\/h3>\n\n\n\n<p>The authors conduct experiments on various datasets including:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Kinetics-400 and Kinetics-600<\/li>\n\n\n\n<li>Something Something v2<\/li>\n\n\n\n<li>Charades<\/li>\n\n\n\n<li>AVA<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>However, we are most interested in the performance on the Kinetics dataset as we will be using a model pretrained on that.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-kinetics-400-kinetics-600-results.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"600\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-kinetics-400-kinetics-600-results.png\" alt=\"Result of the MViT model on the Kinetics-400 and Kinetics-600 datasets.\" class=\"wp-image-35145\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-kinetics-400-kinetics-600-results.png 600w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-kinetics-400-kinetics-600-results-300x300.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-kinetics-400-kinetics-600-results-150x150.png 150w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4. Result of the MViT model on the Kinetics-400 and Kinetics-600 dataset.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The above figure shows the comparison of Multiscale Vision Transformer with various other models. On the Kinetics-400 dataset, it is easily nearing the performance and even beating some of the ViT based models which have ImageNet pretrained backbones. We can see a similar trend on the Kinetics-600 dataset as well. It is noteworthy because the MViT backbones were not pretrained on the ImageNet dataset.<\/p>\n\n\n\n<p>Furthermore, the parameters of the MViT models are much less compared to the ViT based models. Although it seems that the X3D models are competitive with MViT even with less number of parameters and without pretraining.<\/p>\n\n\n\n<p>In the next section, we will start with the coding part where we will use a pretrained MViT model from Torchvision to run inference on various videos.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Inference using Multiscale Vision Transformer<\/h2>\n\n\n\n<p>PyTorch provides a pretrained version of the MViT model. It contains two models, a base model and a small model. We will use the base model which has been trained on the Kinetics-400 dataset. So, it can detect 400 different classes of actions, tasks, and situations from the Kinetics dataset. The dataset is primarily used to train models for action recognition. You may find <strong><a href=\"https:\/\/gist.github.com\/willprice\/f19da185c9c5f32847134b87c1960769\" target=\"_blank\" rel=\"noreferrer noopener\">this GitHub gist<\/a><\/strong> helpful to know more about the classes in the dataset.<\/p>\n\n\n\n<p>Let&#8217;s take a look at the directory structure before moving forward.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\u251c\u2500\u2500 input\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 bowling.mp4\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 push_ups.mp4\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 welding.mp4\n\u251c\u2500\u2500 outputs\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 barbell_biceps_curl.mp4\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 bowling.mp4\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 chest_fly_machine.mp4\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 push_ups.mp4\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 welding.mp4\n\u251c\u2500\u2500 inference_video.py\n\u2514\u2500\u2500 labels.txt<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">input<\/code> directory contains the videos that we will use for inference.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">outputs<\/code> directory contains the inference outputs.<\/li>\n\n\n\n<li>And the parent project directory contains the inference script and a text file containing the class names separated by new lines.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#ffb76a\"><strong><em>The input files, Python script, and the label text file are downloadable via the download section.<\/em><\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"download-code\">Download Code<\/h3>\n\n\n\n<div class=\"wp-block-button is-style-outline center\"><a data-sumome-listbuilder-id=\"34841eb0-4e70-40cc-b2f0-d9efe563c0a5\" class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background\"><b>Download the Source Code for this Tutorial<\/b><\/a><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Video Inference using MViT<\/h3>\n\n\n\n<p>Let&#8217;s jump into the code now.<\/p>\n\n\n\n<p>Starting with the import statements and constructing the argument parser.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"inference_video.py\" data-enlighter-group=\"inference_video_1\">import torch\nimport cv2\nimport argparse\nimport time\nimport numpy as np\nimport os\nimport albumentations as A\nimport time\n\nfrom torchvision.models.video import mvit_v1_b\n\n# Construct the argument parser.\nparser = argparse.ArgumentParser()\nparser.add_argument(\n    '-i', '--input', \n    help='path to input video'\n)\nparser.add_argument(\n    '-c', '--clip-len', \n    dest='clip_len', \n    default=16, \n    help='number of frames to consider for each prediction',\n    type=int,\n)\nparser.add_argument(\n    '--show',\n    action='store_true',\n    help='pass to show the video while execution is going on, \\\n          but requires to uninstall PyAV (`pip uninstall av`)'\n)\nparser.add_argument(\n    '--imgsz',\n    default=(256, 256),\n    nargs='+',\n    type=int,\n    help='image resize resolution'\n)\nparser.add_argument(\n    '--crop-size',\n    dest='crop_size',\n    default=(224, 224),\n    nargs='+',\n    type=int,\n    help='image cropping resolution'\n)\nargs = parser.parse_args()<\/pre>\n\n\n\n<p>We need the <strong><a href=\"https:\/\/debuggercafe.com\/image-augmentation-using-pytorch-and-albumentations\/\" target=\"_blank\" rel=\"noreferrer noopener\">Albumentations<\/a><\/strong> library for carrying out the validation transforms. We also import the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">mvit_v1_b<\/code> that we will use to initialize the model.<\/p>\n\n\n\n<p>Let&#8217;s go through the different command line flags:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">--input<\/code>: The path to the input video.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">--clip-len<\/code>: This is an integer defining the number of video frames that we will feed at a time to the model. It has been trained with a minimum temporal length of 16 frames, so we use that as the default value. The shape of the tensor going into the model during inference will be <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">[batch_size, num_channels, clip_len, height, width]<\/code>. As we are using RGB video, the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">num_channels<\/code> will be 3. <\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">--show<\/code>: This is a boolean value indicating that we want to visualize the results on screen during inference.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">--imgsz<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">--crop-size<\/code>: The image size for resizing and final crop size. This also follows the training hyperparameters where the input image was first resized to 256&#215;256 resolution and cropped to 224&#215;224.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>Considering the default values, the shape of the tensor going into the model will be <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">[1, 3, 16, 224, 225]<\/code>.<\/p>\n\n\n\n<p>Next, let&#8217;s create the output directory, define the transforms, and load the MViT model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"47\" data-enlighter-title=\"inference_video.py\" data-enlighter-group=\"inference_video_2\">OUT_DIR = os.path.join('outputs')\nos.makedirs(OUT_DIR, exist_ok=True)\n\n# Define the transforms.\ncrop_size = tuple(args.crop_size)\nresize_size = tuple(args.imgsz)\ntransform = A.Compose([\n    A.Resize(resize_size[1], resize_size[0], always_apply=True),\n    A.CenterCrop(crop_size[1], crop_size[0], always_apply=True),\n    A.Normalize(\n        mean=[0.45, 0.45, 0.45],\n        std=[0.225, 0.225, 0.225], \n        always_apply=True\n    )\n])\n\n#### PRINT INFO #####\nprint(f\"Number of frames to consider for each prediction: {args.clip_len}\")\nprint('Press q to quit...')\n\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n\n# Load the model.\nmodel = mvit_v1_b(weights='DEFAULT').to(device).eval()\n\n# Load the labels file.\nwith open('labels.txt', 'r') as f:\n    class_names = f.readlines()\n    f.close()<\/pre>\n\n\n\n<p>Do note use mean and standard deviation values as explained in the <strong><a href=\"https:\/\/pytorch.org\/vision\/stable\/models\/generated\/torchvision.models.video.mvit_v1_b.html#torchvision.models.video.mvit_v1_b\" target=\"_blank\" rel=\"noreferrer noopener\">official PyTorch documentation<\/a><\/strong>.<\/p>\n\n\n\n<p>We are loading the Multiscale Vision Transformer model with the default weights which will automatically choose the best pretrained weights. <\/p>\n\n\n\n<p>We read the labels file and store them in <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">class_names<\/code>.<\/p>\n\n\n\n<p>Now, we need to read the video file and initialize the rest of the variables.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"76\" data-enlighter-title=\"inference_video.py\" data-enlighter-group=\"inference_video_3\"># Get the frame width and height.\nframe_width = int(cap.get(3))\nframe_height = int(cap.get(4))\nfps = int(cap.get(5))\n\nsave_name = f\"{args.input.split('\/')[-1].split('.')[0]}\"\n# Define codec and create VideoWriter object.\nout = cv2.VideoWriter(\n    f\"{OUT_DIR}\/{save_name}.mp4\", \n    cv2.VideoWriter_fourcc(*'mp4v'), \n    fps, \n    (frame_width, frame_height)\n)\n\nframe_count = 0 # To count total frames.\ntotal_fps = 0 # To get the final frames per second.\n\n# A clips list to append and store the individual frames.\nclips = []<\/pre>\n\n\n\n<p>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">clips<\/code> list will be used to store the temporal frames, 16 in our case.<\/p>\n\n\n\n<p>Finally, looping over the video frames and carrying out the inference.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"95\" data-enlighter-title=\"inference_video.py\" data-enlighter-group=\"inference_video_4\"># Read until end of video.\nwhile(cap.isOpened()):\n    # Capture each frame of the video.\n    ret, frame = cap.read()\n    if ret == True:\n        # Get the start time.\n        start_time = time.time()\n\n        image = frame.copy()\n        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\n        frame = transform(image=frame)['image']\n        clips.append(frame)\n\n        if len(clips) == args.clip_len:\n            input_frames = np.array(clips)\n            # Add an extra dimension        .\n            input_frames = np.expand_dims(input_frames, axis=0)\n            # Transpose to get [1, 3, num_clips, height, width].\n            input_frames = np.transpose(input_frames, (0, 4, 1, 2, 3))\n            # Convert the frames to tensor.\n            input_frames = torch.tensor(input_frames, dtype=torch.float32)\n            input_frames = input_frames.to(device)\n\n            with torch.no_grad():\n                outputs = model(input_frames)\n\n            # Get the prediction index.\n            _, preds = torch.max(outputs.data, 1)\n            \n            # Map predictions to the respective class names.\n            label = class_names[preds].strip()\n\n            # Get the end time.\n            end_time = time.time()\n            # Get the fps.\n            fps = 1 \/ (end_time - start_time)\n            # Add fps to total fps.\n            total_fps += fps\n            # Increment frame count.\n            frame_count += 1\n            print(f\"Frame: {frame_count}, FPS: {fps:.1f}\")\n\n            cv2.putText(\n                image, \n                label, \n                (15, 25),\n                cv2.FONT_HERSHEY_SIMPLEX, \n                0.8, \n                (0, 0, 255), \n                2, \n                lineType=cv2.LINE_AA\n            )\n            cv2.putText(\n                image, \n                f\"{fps:.1f} FPS\", \n                (15, 55),\n                cv2.FONT_HERSHEY_SIMPLEX, \n                0.8, \n                (0, 0, 255), \n                2, \n                lineType=cv2.LINE_AA\n            )\n\n            clips.pop(0)\n\n            if args.show:\n                cv2.imshow('image', image)\n                # Press `q` to exit.\n                if cv2.waitKey(1) &amp; 0xFF == ord('q'):\n                    break\n            out.write(image)\n    else:\n        break\n\n# Release VideoCapture().\ncap.release()\n# Close all frames and video windows.\ncv2.destroyAllWindows()\n# Calculate and print the average FPS.\navg_fps = total_fps \/ frame_count\nprint(f\"Average FPS: {avg_fps:.3f}\")<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Forward Pass Note<\/h4>\n\n\n\n<p>The only major part to note here starts from <strong>line 108<\/strong>. We do not start the forward pass until we have 16 frames in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">clips<\/code> list. We convert the entire list into a NumPy array and apply further preprocessing of transposing and converting them to PyTorch tensors.<\/p>\n\n\n\n<p>Then we forward an entire batch containing 16 frames thgrough the model, predict the class name, calculate the FPS, and annotate the current frame with the FPS and the class name.<\/p>\n\n\n\n<p>Finally, we visualize the frame on the screen and store the result to disk.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Executing the Video Inference Script<\/h3>\n\n\n\n<p>While executing, we need to provide the path to the input file as a mandatory argument.<\/p>\n\n\n\n<p>Let&#8217;s start with a bowling video.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python inference_video.py --input input\/bowling.mp4 --show<\/pre>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 640 \/ 360;\" width=\"640\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-bowling.mp4\"><\/video><figcaption class=\"wp-element-caption\">Clip 1. Video inference result using Multiscale Vision Transformer on a bowling video.<\/figcaption><\/figure>\n\n\n\n<p>The results are quite good. The model can predict the action correctly in all the frames.<\/p>\n\n\n\n<p>Next, let&#8217;s try another activity.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python inference_video.py --input input\/welding.mp4 --show<\/pre>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 640 \/ 360;\" width=\"640\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-welding.mp4\"><\/video><figcaption class=\"wp-element-caption\">Clip 2. The MViT model is predicting the welding action correctly in all the frames.<\/figcaption><\/figure>\n\n\n\n<p>Here also, the model can detect the welding action correctly.<\/p>\n\n\n\n<p>Finally, let&#8217;s try a video where the model performs poorly.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python inference_video.py --input input\/push_ups.mp4 --show<\/pre>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video height=\"360\" style=\"aspect-ratio: 640 \/ 360;\" width=\"640\" autoplay controls loop muted src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/multiscale-vision-transformer-push_ups.mp4\"><\/video><figcaption class=\"wp-element-caption\">Clip 3. The Multiscale Vision Transformer model is predicting some of the frames wrongly in this push-up video.<\/figcaption><\/figure>\n\n\n\n<p>In this case, the model is failing whenever is person is at the extreme top or bottom. It&#8217;s rather difficult to pinpoint why that might be. It may happen that the model not getting enough information for that period of time when the person is pausing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Some Real Life Use Cases of Fine-Tuning Video Action Recognition Models<\/h2>\n\n\n\n<p>There are several cases where we may want to fine-tune a <strong><a href=\"https:\/\/debuggercafe.com\/human-action-recognition-in-videos-using-pytorch\/\" target=\"_blank\" rel=\"noreferrer noopener\">video action recognition<\/a><\/strong> model.<\/p>\n\n\n\n<p>These include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sports analytics<\/li>\n\n\n\n<li>Surveillance and monitoring<\/li>\n\n\n\n<li>Healthcare monitoring<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>We will try to tackle these use cases in future articles.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Further Reading<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/debuggercafe.com\/action-recognition-in-videos-using-deep-learning-and-pytorch\/\" target=\"_blank\" rel=\"noreferrer noopener\">Action Recognition in Videos using Deep Learning and PyTorch<\/a><\/strong><\/li>\n\n\n\n<li><strong><a href=\"https:\/\/debuggercafe.com\/train-s3d-video-classification-model\/\" target=\"_blank\" rel=\"noreferrer noopener\">Train S3D Video Classification Model using PyTorch<\/a><\/strong><\/li>\n\n\n\n<li><strong><a href=\"https:\/\/debuggercafe.com\/training-a-video-classification-model\/\" target=\"_blank\" rel=\"noreferrer noopener\">Training a Video Classification Model from Torchvision<\/a><\/strong><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Summary and Conclusion<\/h2>\n\n\n\n<p>In this article, we discussed the Multiscale Vision Transformer model including its contribution, architecture, and also running inference on videos. We analyzed the results and found that the model may be falling behind in some action recognition tasks. We also discussed some use cases where we can fine-tune such action recognition models. I hope that this article was worth your time.<\/p>\n\n\n\n<p>If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.<\/p>\n\n\n\n<p>You can contact me using the <strong><a aria-label=\"Contact (opens in a new tab)\" href=\"https:\/\/debuggercafe.com\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">Contact<\/a><\/strong> section. You can also find me on <strong><a aria-label=\"LinkedIn (opens in a new tab)\" href=\"https:\/\/www.linkedin.com\/in\/sovit-rath\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a><\/strong>, and <strong><a href=\"https:\/\/x.com\/SovitRath5\" target=\"_blank\" rel=\"noreferrer noopener\">X<\/a><\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Acknowledgment<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/arxiv.org\/abs\/2104.11227\" target=\"_blank\" rel=\"noreferrer noopener\">Multiscale Vision Transformers<\/a><\/strong><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we go cover the Multiscale Vision Transformer model for video action recognition.<\/p>\n","protected":false},"author":1,"featured_media":35157,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[76,90,479,529],"tags":[787,788,789,793,1383,1384,790,791,794,792],"class_list":["post-35083","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-vision","category-pytorch","category-video-classification","category-vision-transformer","tag-multiscale-vision-transformer","tag-mvit","tag-mvit-pytorch","tag-mvit-video-action-recognition","tag-mvit-video-classification","tag-mvit-video-recognition","tag-pytorch-multiscale-vision-transformer","tag-torchvision-mvit","tag-video-classification-mvit","tag-video-recognition-using-multiscale-vision-transformer"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Multiscale Vision Transformer for Video Recognition<\/title>\n<meta name=\"description\" content=\"Multiscale Vision Transformer is a Transformer based video recognition model which learns from high and low resolution spatial inputs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Multiscale Vision Transformer for Video Recognition\" \/>\n<meta property=\"og:description\" content=\"Multiscale Vision Transformer is a Transformer based video recognition model which learns from high and low resolution spatial inputs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/\" \/>\n<meta property=\"og:site_name\" content=\"DebuggerCafe\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/profile.php?id=100013731104496\" \/>\n<meta property=\"article:published_time\" content=\"2024-04-01T00:30:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-16T01:25:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sovit Ranjan Rath\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:site\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sovit Ranjan Rath\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/\"},\"author\":{\"name\":\"Sovit Ranjan Rath\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"headline\":\"Multiscale Vision Transformer for Video Recognition\",\"datePublished\":\"2024-04-01T00:30:00+00:00\",\"dateModified\":\"2025-06-16T01:25:23+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/\"},\"wordCount\":1615,\"commentCount\":0,\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png\",\"keywords\":[\"Multiscale Vision Transformer\",\"MViT\",\"MViT PyTorch\",\"MViT Video Action Recognition\",\"MViT Video Classification\",\"MViT Video Recognition\",\"PyTorch Multiscale Vision Transformer\",\"Torchvision MViT\",\"Video Classification MViT\",\"Video Recognition using Multiscale Vision Transformer\"],\"articleSection\":[\"Computer Vision\",\"PyTorch\",\"Video Classification\",\"Vision Transformer\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/\",\"url\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/\",\"name\":\"Multiscale Vision Transformer for Video Recognition\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png\",\"datePublished\":\"2024-04-01T00:30:00+00:00\",\"dateModified\":\"2025-06-16T01:25:23+00:00\",\"author\":{\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"description\":\"Multiscale Vision Transformer is a Transformer based video recognition model which learns from high and low resolution spatial inputs.\",\"breadcrumb\":{\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage\",\"url\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png\",\"contentUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png\",\"width\":1000,\"height\":563,\"caption\":\"Multiscale Vision Transformer for Video Recognition\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/debuggercafe.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Multiscale Vision Transformer for Video Recognition\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/debuggercafe.com\/#website\",\"url\":\"https:\/\/debuggercafe.com\/\",\"name\":\"DebuggerCafe\",\"description\":\"Machine Learning and Deep Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/debuggercafe.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\",\"name\":\"Sovit Ranjan Rath\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"caption\":\"Sovit Ranjan Rath\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Multiscale Vision Transformer for Video Recognition","description":"Multiscale Vision Transformer is a Transformer based video recognition model which learns from high and low resolution spatial inputs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/","og_locale":"en_US","og_type":"article","og_title":"Multiscale Vision Transformer for Video Recognition","og_description":"Multiscale Vision Transformer is a Transformer based video recognition model which learns from high and low resolution spatial inputs.","og_url":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/","og_site_name":"DebuggerCafe","article_publisher":"https:\/\/www.facebook.com\/profile.php?id=100013731104496","article_published_time":"2024-04-01T00:30:00+00:00","article_modified_time":"2025-06-16T01:25:23+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png","type":"image\/png"}],"author":"Sovit Ranjan Rath","twitter_card":"summary_large_image","twitter_creator":"@SovitRath5","twitter_site":"@SovitRath5","twitter_misc":{"Written by":"Sovit Ranjan Rath","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#article","isPartOf":{"@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/"},"author":{"name":"Sovit Ranjan Rath","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"headline":"Multiscale Vision Transformer for Video Recognition","datePublished":"2024-04-01T00:30:00+00:00","dateModified":"2025-06-16T01:25:23+00:00","mainEntityOfPage":{"@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/"},"wordCount":1615,"commentCount":0,"image":{"@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png","keywords":["Multiscale Vision Transformer","MViT","MViT PyTorch","MViT Video Action Recognition","MViT Video Classification","MViT Video Recognition","PyTorch Multiscale Vision Transformer","Torchvision MViT","Video Classification MViT","Video Recognition using Multiscale Vision Transformer"],"articleSection":["Computer Vision","PyTorch","Video Classification","Vision Transformer"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/","url":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/","name":"Multiscale Vision Transformer for Video Recognition","isPartOf":{"@id":"https:\/\/debuggercafe.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage"},"image":{"@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png","datePublished":"2024-04-01T00:30:00+00:00","dateModified":"2025-06-16T01:25:23+00:00","author":{"@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"description":"Multiscale Vision Transformer is a Transformer based video recognition model which learns from high and low resolution spatial inputs.","breadcrumb":{"@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/debuggercafe.com\/multiscale-vision-transformer\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#primaryimage","url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png","contentUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2024\/02\/Multiscale-Vision-Transformer-for-Video-Recognition-e1707615127542.png","width":1000,"height":563,"caption":"Multiscale Vision Transformer for Video Recognition"},{"@type":"BreadcrumbList","@id":"https:\/\/debuggercafe.com\/multiscale-vision-transformer\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/debuggercafe.com\/"},{"@type":"ListItem","position":2,"name":"Multiscale Vision Transformer for Video Recognition"}]},{"@type":"WebSite","@id":"https:\/\/debuggercafe.com\/#website","url":"https:\/\/debuggercafe.com\/","name":"DebuggerCafe","description":"Machine Learning and Deep Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/debuggercafe.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752","name":"Sovit Ranjan Rath","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","caption":"Sovit Ranjan Rath"}}]}},"_links":{"self":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/35083","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/comments?post=35083"}],"version-history":[{"count":84,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/35083\/revisions"}],"predecessor-version":[{"id":38144,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/35083\/revisions\/38144"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media\/35157"}],"wp:attachment":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media?parent=35083"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/categories?post=35083"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/tags?post=35083"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}