Skip to content

JananiV3010/ResNet-ViT-Analysis

Repository files navigation

Deep Learning Architecture Analysis: ResNet-18 vs. Vision Transformers (ViT)

A systematic benchmarking study comparing CNNs and Vision Transformers on CIFAR-10 with implications for architecture selection in medical imaging.


Overview

Choosing between CNNs and Vision Transformers is one of the most consequential decisions in modern medical imaging AI. This project provides a rigorous empirical comparison of ResNet-18 and ViT under three training conditions, deriving practical insights for small-dataset medical imaging contexts.


Results

Model Training Approach Best Test Accuracy Epochs
ResNet-18 From scratch 90.48% 100
ViT (from scratch) From scratch 83.39% 40
ViT (from scratch, extended) Extended training 83.69% 60
ViT-S/16 (transfer learning) ImageNet → CIFAR-10 98.75% 30

Key Finding: CNNs outperform ViTs on small datasets when trained from scratch due to built-in inductive biases (locality, translation equivariance). However, ViT with ImageNet transfer learning surpasses CNNs by a wide margin and converges in fewer epochs.


Per-Class Accuracy Comparison

Class ResNet-18 ViT (scratch) ViT (transfer)
Plane 86.40% 86.00% 99.40%
Car 96.70% 91.60% 99.30%
Bird 88.30% 78.10% 98.70%
Cat 84.40% 67.20% 96.40%
Deer 94.70% 81.40% 99.20%
Dog 80.90% 76.40% 98.10%
Frog 91.80% 87.90% 99.60%
Horse 94.40% 87.20% 99.00%
Ship 95.00% 89.40% 99.80%
Truck 92.20% 88.70% 97.90%

Notable pattern: Cat and Dog are the hardest classes across all models but transfer learning closes the gap dramatically (67.2% → 96.4% for Cat). This mirrors the challenge of fine-grained visual distinctions in medical imaging (e.g., tumor grading).


Implications for Medical Imaging

Medical datasets are typically small and expensive to label. This study suggests:

  • For limited labeled data with no pretrained weights available: CNNs (ResNet) are more reliable trained from scratch and ViT's lack of inductive biases becomes a liability
  • For any access to pretrained weights: ViT with transfer learning is the stronger choice, even when fine-tuning on a much smaller target domain
  • Transfer learning is not optional for Transformers on small datasets: It's essential. Extending ViT training from 40 to 60 epochs yielded only +0.30% improvement, while transfer learning yielded +15.36%

Experiment Design

Experiment 1: ResNet-18 from Scratch

  • Architecture: Custom ResNet-18 with BasicBlock residual connections (11.17M parameters)
  • Optimizer: SGD (lr=0.1, momentum=0.9, weight_decay=5e-4)
  • LR Schedule: Cosine annealing over 100 epochs
  • Augmentation: Random crop (32×32, padding=4) + random horizontal flip
  • Training time: ~73 minutes (GPU)

Experiment 2: ViT from Scratch

  • Architecture: Custom Vision Transformer with patch embeddings (4×4 patches on 32×32 input)
  • Optimizer: AdamW (lr=0.001, weight_decay=0.05)
  • LR Schedule: Cosine annealing with 5-epoch linear warmup
  • Training: 40 epochs initial → resumed for 20 additional epochs with lower LR
  • Training time: ~35 min (40 epochs) + ~18 min (20 epochs)

Experiment 3: ViT with ImageNet Transfer Learning

  • Architecture: ViT-S/16 (pretrained on ImageNet-1K, via timm)
  • Input: Resized to 224×224 (ViT's native resolution)
  • Optimizer: AdamW (lr=0.0001, weight_decay=0.01)
  • Augmentation: AutoAugment (CIFAR-10 policy) + RandomErasing
  • Training time: ~262 minutes (30 epochs)

Repository Structure

resnet-vit-analysis/
│
├── notebooks/
│   ├── 01_ResNet18_from_scratch.ipynb        # ResNet-18 training + evaluation
│   ├── 02_ViT_from_scratch.ipynb             # ViT training (40 + 20 epochs)
│   └── 03_ViT_transfer_learning.ipynb        # ViT-S/16 ImageNet fine-tuning
├── requirements.txt
└── README.md

Getting Started

git clone https://github.com/JananiV3010/resnet-vit-analysis.git
cd resnet-vit-analysis
pip install -r requirements.txt

Open any notebook in Google Colab (GPU recommended) or Jupyter:

jupyter notebook notebooks/

Tech Stack

Python PyTorch timm CIFAR-10 ResNet-18 Vision Transformer (ViT) ImageNet scikit-learn Google Colab


Author

Janani Vaiyapuriappan, MSE Biomedical Engineering, Johns Hopkins University
LinkedIn · GitHub


Course project — Machine Perception, Johns Hopkins University (Fall 2025)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors