Animal-Classifier

Introduction

This project is a basic neural network setup for image classification into classes of animals. Though it attempts to mimic how a human observes and infers, the simple rule behind it is to decipher the images into mathematical patterns and connect them to known categories and traits. The project focuses on building the model from scratch without pretrained architectures to reinforce foundational understanding of convolutional neural networks (CNNs). Major focus was put on improving its limited understanding from the meaningful features from images and further training it in a way that it interprets context beyond simply image patterns to the semantic features.

Python Utilities

pytorch: to load datasets, define model architecture, train the model, and test the model.
matplotlib: to plot model loss and accuracy
pandas: to save predictions of model to a csv file
PIL (Python Imaging Library): to load images
tqdm: to log the epoch progress with accuracy and loss updates
os: to handle directory structures

Data

Raw data

Dataset: Contains over 10,000 images spanning 50 animal classes, divided into train and test folders.
predicate-matrix-binary.txt: Binary matrix representing 85 semantic features for each class.
classes.txt: A file listing all class labels.

Data Preprocessing

Addressed discrepancies in the dataset:
- Training folder contains only 40 classes, while the test folder includes 50 (unlabeled) classes.
- Updated classes.txt and predicate-matrix-binary.txt to align known classes and account for unseen ones.
Split the training data into 80-20 ratio for training and validation.
Randomly grouped into batches of 32 for applying mini-batch gradient descent later for the training.
Image Preprocessing:
- Resize(): resizes images to a standard size of 224x224 pixels using bilinear interpolation.
- ToTensor(): converts PIL image into tensor with pixel values scaled to [0, 1]
- Normalize(): normalizes the images using channel-wise mean and standard deviations (RGB channels), consistent with ImageNet standards
  - Mean: [0.485, 0.456, 0.406]
  - Standard Deviation: [0.229, 0.224, 0.225]
  - Normalized Pixel Value $= \frac{Pixel Value - Mean}{Standard Deviation}$

Data Augmentation

RandomHorizontalFlip(): randomly flips some of the images in the batches
ColorJitter(): randomly changes image brightness, contrast, saturation, and hue.
RandomAffine(): applies random affine transformations such as rotation, scaling, and translation

Model Architecture

A multi-task learning approach (or, auxiliary task learning) to the classification problem that learns both
- the image features (Pytorch tensor of its pixel values): single-label, and
- the predicate features (semantics with the picture): multi-label
Five convolutional layers (applied to image tensor):
- Convolution
- Batch Normalization: to stabilize learning
- ReLU Activation: to introduce non-linearity to the model
- Maximum Pooling: to reduce the dimensionality of the images
Fully connected layer for classification by image features
- With ReLU activation and Dropout for generalization
Fully connected layer for predicate (to the same image tensor after convolution)
- 2 x (ReLU and Dropout) and then converting to a linear with num_predicates attributes to compare with the predicates
Returns both the predicates and class label for the training and other computations

Training Setup

Weight and Bias Initialization
- weights are set by kaiming_normal ensuring that the variance of activations is maintained across layers, avoiding vanishing or exploding gradients during training
- all bias is set to zero
Optimizer
- Adam Optimizer: Adaptive Moment Estimation, a combination of the ‘gradient descent with momentum’ algorithm and the ‘RMSP’ algorithm.
- Learning Rate = 1e-3
- Weight Decay = 1e-4
Learning Rate Scheduler
- OneCycleLR(): anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate.
Loss Functions
- CrossEntropyLoss(): for the classes labels
  - CrossEntropyLoss(reduction = 'mean') $= -\frac{w_{y_i}*\log(\frac{\exp(x_{i, y_i})}{\sum_{c=1}^C (\exp(x_{i, c}))})}{\sum_{i=1}^N (w_{y_i})}$
  - (Similar to applying Softmax and then Negative Log Likelihood Loss)
- BinaryCrossEntropyLoss(): for the predicates predicted by the model
  - $BCE = -w_i*\frac{1}{N}\sum_{i=1}^N (y_i*\log(p(y_i))+(1-y_i)*\log(1-p(y_i)))$
- lossfn_alignment() (Custom loss function): for the alignment of image features predicted and the predicates of the class.
  - takes dot product of predicted class tensor from model with the predicate matrix, and
  - computes mse loss with this projection and the true predicates from the matrix
Number of Epochs
- 30
Hardware
- Kaggle GPU P100

Testing Setup

utilizes a simpler zero-shot learning approach
- calculate probability of a class by softmax on the classes returned by the model, max = predicted_class1
- take a dynamically weighted similarity: Intersection over Union and Cosine Similarity to check the predicates
- the class with higher similarity = predicted_class2
- if the predicted_class2 is in the unseen bunch, actual predicted class = predicted_class2
- if not,
  - if the probability of that class is greater than a confidence threshold, actual predicted class = predicted_class2
  - if not, actual predicted class = predicted_class1
- return the actual predicted class

Results

Training Accuracy: ~70%
Validation and Test Accuracy: ~30-40%

Process and Challenges Encountered

Initial implementation resulted in low accuracy (~10% on the test set). Subsequent iterations improved results through:
- Data augmentation techniques.
- Optimizer, scheduler, and loss function refinements.
A pivotal change involved predicting predicates separately and introducing alignment loss, which improved validation accuracy to ~40%.
Challenges faced:
- Proper integration of the predicate matrix.
- Computational overhead due to added complexity.
- Limited understanding of confusion matrices and class-specific optimizations.
Future directions include refining hyperparameters, exploring pretrained models, and explicitly implementing zero-shot learning.

Explainability Report

In the way how a human would identify animals, the network followed this method for understanding images:

Edges and textures: The first layer identifies simple patterns like edges or corners, similar to how humans see outlines.
Complex structures: As the image passes through deeper layers, the model combines simple patterns to detect more complex features like a beak, claws, or fur patterns.
After extracting features, the model condenses the information into the most noticeable features while ignoring the unnecessary details (e.g., a lion's mane or an elephant’s trunk), using pooling layers.

The performance of the model indicates that it does extract meaningful features for seen classes. These features likely include:

Distinct patterns like stripes for zebras.
Shape-based traits like trunks for elephants.
Texture-related features like fur or scales.

The model achieves reasonable performance on unseen classes due to the semantic predicates (binary matrix) guiding the predictions.

The predicate “stripes” helps identify unseen animals like okapis or other striped species.
The predicate “flys” aids in distinguishing birds from terrestrial animals.

The accuracy gap highlights that the predicates and image features must be perfectly aligned for optimal performance, leaving a scope for future development.

Learning Outcomes

Basic implementation of a neural network
Mathematics behind a deep learning model and the training algorithm: backpropagation
Tradeoff between accuracy, complexity, and computational overhead: fine-tuning the parameters, iterative problem solving techniques
Zero-shot learning concepts

Conclusion

While the model demonstrates low/moderate accuracy, it underscores the importance of experimentation and theoretical grounding in building custom architectures. Model performance is limited by hyperparameter choices and the absence of pretrained architectures. Future work could involve hyperparameter tuning, incorporating PCA, leveraging embeddings for semantic relationships, and explicitly implementing zero-shot learning.

Reference

This is a submission to the Pixel Play Challenge by Vision and Language Group, IIT Roorkee. The problem statement and referenced data may be found on Kaggle.

Info

the drafts folder contains previous codes that were improvised to get the final one. Feel free to ignore them.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
drafts		drafts
test		test
train		train
README.md		README.md
best_model.pth		best_model.pth
classes.txt		classes.txt
loading_datasets.py		loading_datasets.py
main.py		main.py
model_definition.py		model_definition.py
plot.png		plot.png
predicate-matrix-binary.txt		predicate-matrix-binary.txt
predicates.txt		predicates.txt
predictions.csv		predictions.csv
test_the_model.py		test_the_model.py
train_the_model.py		train_the_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Animal-Classifier

Introduction

Python Utilities

Data

Raw data

Data Preprocessing

Data Augmentation

Model Architecture

Training Setup

Testing Setup

Results

Process and Challenges Encountered

Explainability Report

Learning Outcomes

Conclusion

Reference

Info

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Animal-Classifier

Introduction

Python Utilities

Data

Raw data

Data Preprocessing

Data Augmentation

Model Architecture

Training Setup

Testing Setup

Results

Process and Challenges Encountered

Explainability Report

Learning Outcomes

Conclusion

Reference

Info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages