First things first, we agree you could not see a demo video on the top, but the deadline was close and we could not record a video (It was taking really, really long to run the model, the laptop already had its last breath, serving for more than 30 hours continuously.) Please believe the accuracy mentioned here and if you could, please cross verify by loading the best_model.pth and testing it once.

Inspiration

This project is a basic neural network setup for image classification into classes of animals. We are in the phase of learning AI and with the basic knowledge of deep learning and neural networks, we tried to solve a simple classification problem as the first step to building a full-fledged model. Though it attempts to mimic how a human observes and infers, the simple rule behind it is to decipher the images into mathematical patterns and connect them to known categories and traits. The project focuses on building the model from scratch without pretrained architectures to reinforce foundational understanding of convolutional neural networks (CNNs). Major focus was put on improving its limited understanding from the meaningful features from images and further training it in a way that it interprets context beyond simply image patterns to the semantic features.

What it does

This project learns patterns from the image dataset and a binary predicate matrix using a simple neural network and tests its accuracy over different classes of animals. It predicts a class of the animal (among the labels told to it) and stores it in a CSV file.

In the way how a human would identify animals, the network followed this method for understanding images:

Edges and textures: The first layer identifies simple patterns like edges or corners, similar to how humans see outlines.

Complex structures: As the image passes through deeper layers, the model combines simple patterns to detect more complex features like a beak, claws, or fur patterns.

After extracting features, the model condenses the information into the most noticeable features while ignoring the unnecessary details (e.g., a lion's mane or an elephant’s trunk), using pooling layers.

The performance of the model indicates that it does extract meaningful features for seen classes. These features likely include:

Distinct patterns like stripes for zebras. Shape-based traits like trunks for elephants. Texture-related features like fur or scales. The model achieves reasonable performance on unseen classes due to the semantic predicates (binary matrix) guiding the predictions.

The predicate “stripes” helps identify unseen animals like okapis or other striped species.

The predicate “flys” aids in distinguishing birds from terrestrial animals.

The accuracy gap highlights that the predicates and image features must be perfectly aligned for optimal performance, leaving a scope for future development.

How we built it

Raw data:

  • Dataset: Contains over 10,000 images spanning 50 animal classes, divided into train and test folders.
  • predicate-matrix-binary.txt: Binary matrix representing 85 semantic features for each class.
  • classes.txt: A file listing all class labels.

Data Preprocessing:

  • Addressed discrepancies in the dataset:

    • Training folder contains only 40 classes, while the test folder includes 50 (unlabeled) classes.
    • Updated classes.txt and predicate-matrix-binary.txt to align known classes and account for unseen ones.
  • Split the training data into 80-20 ratio for training and validation.

  • Randomly grouped into batches of 32 for applying mini-batch gradient descent later for the training.

  • Image Preprocessing:

    • Resize(): resizes images to a standard size of 224x224 pixels using bilinear interpolation.
    • ToTensor(): converts PIL image into tensor with pixel values scaled to [0, 1]
    • Normalize(): normalizes the images using channel-wise mean and standard deviations (RGB channels), consistent with ImageNet standards
      • Mean: [0.485, 0.456, 0.406]
      • Standard Deviation: [0.229, 0.224, 0.225]
      • Normalized Pixel Value $= \frac{Pixel Value - Mean}{Standard Deviation}$

Data Augmentation

  • RandomHorizontalFlip(): randomly flips some of the images in the batches
  • ColorJitter(): randomly changes image brightness, contrast, saturation, and hue.
  • RandomAffine(): applies random affine transformations such as rotation, scaling, and translation

Model Architecture

  • A multi-task learning approach (or, auxiliary task learning) to the classification problem that learns both

    • the image features (Pytorch tensor of its pixel values): single-label, and
    • the predicate features (semantics with the picture): multi-label
  • Five convolutional layers (applied to image tensor):

    • Convolution
    • Batch Normalization: to stabilize learning
    • ReLU Activation: to introduce non-linearity to the model
    • Maximum Pooling: to reduce the dimensionality of the images
  • Fully connected layer for classification by image features

    • With ReLU activation and Dropout for generalization
  • Fully connected layer for predicate (to the same image tensor after convolution)

    • 2 x (ReLU and Dropout) and then converting to a linear with num_predicates attributes to compare with the predicates
  • Returns both the predicates and class label for the training and other computations

Training Setup

  • Weight and Bias Initialization

    • weights are set by kaiming_normal ensuring that the variance of activations is maintained across layers, avoiding vanishing or exploding gradients during training
    • all bias is set to zero
  • Optimizer

    • Adam Optimizer: Adaptive Moment Estimation, a combination of the ‘gradient descent with momentum’ algorithm and the ‘RMSP’ algorithm.
    • Learning Rate = 1e-3
    • Weight Decay = 1e-4
  • Learning Rate Scheduler

    • OneCycleLR(): anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate.
  • Loss Functions

    • CrossEntropyLoss(): for the classes labels
      • CrossEntropyLoss(reduction = 'mean') $= -\frac{w_{y_i}*\log(\frac{\exp(x_{i, y_i})}{\sum_{c=1}^C (\exp(x_{i, c}))})}{\sum_{i=1}^N (w_{y_i})}$
      • (Similar to applying Softmax and then Negative Log Likelihood Loss)
    • BinaryCrossEntropyLoss(): for the predicates predicted by the model
      • $BCE = -w_i*\frac{1}{N}\sum_{i=1}^N (y_i*\log(p(y_i))+(1-y_i)*\log(1-p(y_i)))$
    • lossfn_alignment() (Custom loss function): for the alignment of image features predicted and the predicates of the class.
      • takes dot product of predicted class tensor from model with the predicate matrix, and
      • computes mse loss with this projection and the true predicates from the matrix
  • Number of Epochs

    • 30
  • Hardware

    • Kaggle GPU P100

Testing Setup

  • utilizes a simpler zero-shot learning approach

    • calculate probability of a class by softmax on the classes returned by the model, max = predicted_class1
    • take a dynamically weighted similarity: Intersection over Union and Cosine Similarity to check the predicates
    • the class with higher similarity = predicted_class2
    • if the predicted_class2 is in the unseen bunch, actual predicted class = predicted_class2
    • if not,
      • if the probability of that class is greater than a confidence threshold, actual predicted class = predicted_class2
      • if not, actual predicted class = predicted_class1
    • return the actual predicted class

Results

  • Training Accuracy: ~70%

  • Validation and Test Accuracy: ~30-40%

Challenges we ran into

  • Initial implementation resulted in low accuracy (~10% on the test set). Subsequent iterations improved results through:

    • Data augmentation techniques.
    • Optimizer, scheduler, and loss function refinements.
  • A pivotal change involved predicting predicates separately and introducing alignment loss, which improved validation accuracy to ~40%.

  • Challenges faced:

    • Proper integration of the predicate matrix.
    • Computational overhead due to added complexity.
    • Limited understanding of confusion matrices and class-specific optimizations.
  • Future directions include refining hyperparameters, exploring pretrained models, and explicitly implementing zero-shot learning.

Accomplishments that we're proud of

This being our very first implementation of a data model had a full dose of learning for us. It made us realize how minute details in a image that translates into numbers for a computer can impact its definitions as we've given them. We are proud that we made it possible for a certain accuracy, despite the long 6 hours it took us training the model and testing it over multiple images. We are proud of the fact that this field is continuously emerging but the basic solutions lie in the fundamental mathematical and statistical models. We are proud that we could be a part of the journey and learn different techniques to teach a computer model.

What we learned

  • Basic implementation of a neural network

  • Mathematics behind a deep learning model and the training algorithm: backpropagation

  • Tradeoff between accuracy, complexity, and computational overhead: fine-tuning the parameters, iterative problem solving techniques

  • Zero-shot learning concepts

What's next for Animal Classifier

While the model demonstrates low/moderate accuracy, it underscores the importance of experimentation and theoretical grounding in building custom architectures. Model performance is limited by hyperparameter choices and the absence of pretrained architectures. Future work could involve hyperparameter tuning, incorporating PCA, leveraging embeddings for semantic relationships, and explicitly implementing zero-shot learning.

Built With

Share this project:

Updates