Pipeline
Poster

Title

The aim of this paper is to use both CNNs and Transformers to detect objects present in images

People

Titas Grusnius tgrusniu
Zainab Iftikhar ziftikha
Tyler Jacobson tjacobs1
Manar Abdelatty mabdelat

Introduction

The objectives of this paper are to produce a set of predictions of objects contained in images using a framework called DEtection TRansformer (DETR). It produces bounding boxes around objects detected in these images and outputs category labels for each of these objects. We chose this paper because it uses both CNNs and a transformer encoder-decoder structure, and is written in PyTorch, which will allow us to apply the concepts learnt in class and re-implement the paper in Tensorflow. This is a direct set prediction problem, where the model outputs a fixed-size (N) set of predictions for the image on the bounding boxes that it has detected. The model works by first using a CNN to produce a set of image features. It then uses the transformer encoder-decoder structure to produce the set of box predictions. It uses bipartite matching loss to train the model.

Related Work

Stewart et al. link proposed a system for detecting objects in images. They developed a system based on decoding an image into a set of people detections in crowded spaces. The model uses recurrent LSTM to predict a set of bounding boxes from an image. The model encoder decoder architecture uses bi-partitle matching losses. Since, the model works with RNN, unlike the paper we are trying to implement, they do not work with the transformers (that have the leverage of parallel decoding).

Code for related work: link

Code for the paper we’re trying to implement: link

Data

We will be using the COCO (Common Objects in Context) dataset which is a standard dataset for the tasks of object detection, segmentation, image captioning, and key point detection. Specifically, we will use the COCO 2017 release which contains 118K images for training and 5K images for validation. Each image is annotated with a number of attributes like category, bounding box, area, and segmentation. For the object detection task, we will be mainly interested in the bounding box and the object category attributes. Each image has one or more bounding box annotation with an average number of 6 objects per one image and a maximum of 63 objects in one image. The dataset in total has 80 different object categories.

Since COCO is a standard dataset, we don't need to perform any intensive pre-processing. Additionally, accessing and downloading the dataset is made easy through the COCO python APIs that will allow us to directly load the dataset without having to parse the COCO JSON format. Additionally, if we face difficulties with regards to computing resources, we may end up using a subset of the dataset by limiting the number of object categories to 40 or fewer different object categories.

Methodology

The model is composed of three main blocks: (1) CNN backbone, (2) transformer with encoder-decoder architecture, and (3) fully connected network. The CNN backbone is responsible for extracting compact features from the input image. It takes an RGB input image of size 3 x H x W and outputs a lower resolution feature map of size 2048 x H/32 x W/32. The transformer model used follows the standard encoder-decoder architecture. The encoder block is composed of a multi-head attention module and fully connected network. The encoder block takes as an input the addition of the feature map extracted from the CNN backbone and the positional encoding vector. Since the encoder block expects a sequence as input, the feature maps with positional information are flattened to a feature map of size d x HW. The decoder block also follows the standard decoder architecture and is composed of multi-head, encoder-decoder attention modules . The decoder block takes as an input N learnt embeddings referred to as object queries. The output of the encoder is also fed to the decoder to provide context for the whole image. The decoder outputs N output embeddings which are independently fed to fully connected layers to make N output predictions. The output of the fully connected layer is a tuple of box coordinates and class labels.

Since object detection is a supervised learning problem, we will use gradient descent for training. Like the paper implementation, we will use the Adam optimizer for updating the weight parameters. For hyperparameters like learning rate, we will use a learning rate of 10^(-5) for the CNN backbone block and 10^(-4) for the transformer block. Additionally, we will use a pre-trained Res-Net50 model for the CNN backbone. The whole model including the pre-trained backbone will be trained using the bipartite loss function.

The architecture of the model is quite complicated. We expect the hardest part about implementing it would be the transformer block as it is more complicated than CNNs and FNNs. Another challenge we expect to face is the large computational demand of the dataset and the model. We are planning to handle that by using a smaller subset of the dataset and downsizing the model trainable parameters.

Metrics

Our base, target, and stretch goals are defined as such:

Base goal: be able to train a simplified version of the model that is able to distinguish between 20 different object categories.

Target goal: be able to train a model that is able to distinguish between 40 different object categories.

Reach goal: be able to reimplement the model described in the paper to its fullest extent

The purpose of the base goal is to prove that we in fact can reproduce a simplified model with the architecture outlined in the paper. To simplify the model would mean to reduce the number of transformer blocks and feature mappings in the model, for example, while leaving intact the CNN-transformer-dense layer underlying architecture. The challenge in reaching our target and reach goals will be in scaling the model within a reasonable time frame; it took the authors of the paper 3 days to fully train the model with 16 GPUs. Thankfully, the size of our dataset grows exponentially as the number of permissible object categories increases, as we would ignore training data that includes even one instance of an object of an ignored category.

Accuracy is not as relevant of a category for this model, since the model is outputting N predictions for (class, object box) or “No object” for each input image. Rather, the researchers use two loss functions: a bipartite-matching loss, which is novel, and a box loss function, which is standard practice for image recognition. The bipartite-matching loss attempts to match each of the N predictions generated by the model to the y objects in the image, where N is padded with enough “No object” predictions such that N is much larger than y for most images. The bipartite-matching loss function severely penalizes extra or missing object classifications and less severely penalizes incorrect box predictions. The box loss function the researchers use is a linear combination of an L1 regularization loss and a generalized IOU loss, which is scale-invariant. This box loss maximizes the overlap area of a ground truth box label and a predicting bounding box, while not unfairly penalizing large boxes.

Ethics

This problem is a good application of Deep Learning, and attempting to classify images in an efficient and accurate manner is important for society, especially as self-driving cars, sensors, and other IoT devices become more ubiquitous. Image detection and recognition is a long-studied problem for deep learning researchers, and designing novel paradigms for such a problem is critical to continuously improving its performance, which helps to improve the safety of these IoT devices. Data is readily available and labeled within the COCO dataset, making it readily accessible for deep learning researchers.

However, we should concern ourselves with the tradeoff between the added benefit of developing a more accurate image recognition model and the energy required to train such a model. This is especially relevant considering that many electric grids worldwide have yet to fully decarbonize and training deep learning models has historically contributed to considerable Scope 2 carbon emissions.

Division of labor

We will divide the project to the following tasks:

Loading the Data-set and taking a smaller subset by limiting the number of object categories.
Implementing the CNN backbone.
Implementing the encoder in the transformer.
Implementing the decoder in the transformer.
Implementing the fully connected layers.
Implementing the loss function.
Implementing the train and the test functions.
Fine-tuning the architecture and hyper-parameters.

The division of labor will be as follows:

Titas : [1, 2]
Zainab: [3, 8]
Tyler : [6, 7]
Manar: [8, 5]

Project Check-in #2

Project Outline

Project Check-in #3

Introduction

Challenges

The hardest part of the project so far has been putting together the different components of the model together. We started out by dividing the project into four main components: dataset and backbone, transformer-encoder, transformer-decoder, and loss function where each one of the group members took the responsibility of translating the pytorch code of that component to tensorflow. Gluing everything together is a challenge so far since we are testing out our implementation and fixing issues along the way. Additionally, we have faced some challenges with regards to the implementation differences between the Tensorflow and PyTorch APIs. For example, the MultiHeadAttention unit in Tensorflow lacks one of the arguments that the Pytorch implementation uses. Reconciling differences like this has been a challenge.

Insights

Our progress so far is as follows: the four different components of the project: dataset processing and CNN backbone, transformer-encoder, transformer-decoder, and loss function have been implemented in Tensorflow. We also finished putting together the data pre-processing pipeline which includes parsing the training and validation datasets, applying random transformation (resizing, cropping, horizontal flipping), and batching the data. We are currently working on integrating the CNN backbone and the transformer model together to be able to do a full forward pass on the model and to build our training loop.

Plan

Our main remaining tasks with regard to the code are to 1) fully connect together all portions of the model that we translated from PyTorch to Tensorflow separately and 2) implement the testing and training loops. In addition, we are considering filtering the dataset to a limited number of object classes to limit the time required for testing and training the model. Training this model took the researchers three days on 16 GPUs running in parallel; to achieve comparable results on far fewer computing resources, it is necessary to limit the complexity of our data by reducing the number of object classes. Additionally, we are thinking of sticking to fixed-sized images. The original implementation of the paper dealt with variable-sized images and used padding to unify the image sizes across each batch. This has been challenging to implement in tensorflow as the batch function requires all images to have the same size. Overall, we are on track with our project progress.