Deep Learning Architecture Analysis: ResNet-18 vs. Vision Transformers (ViT)

A systematic benchmarking study comparing CNNs and Vision Transformers on CIFAR-10 with implications for architecture selection in medical imaging.

Overview

Choosing between CNNs and Vision Transformers is one of the most consequential decisions in modern medical imaging AI. This project provides a rigorous empirical comparison of ResNet-18 and ViT under three training conditions, deriving practical insights for small-dataset medical imaging contexts.

Results

Model	Training Approach	Best Test Accuracy	Epochs
ResNet-18	From scratch	90.48%	100
ViT (from scratch)	From scratch	83.39%	40
ViT (from scratch, extended)	Extended training	83.69%	60
ViT-S/16 (transfer learning)	ImageNet → CIFAR-10	98.75%	30

Key Finding: CNNs outperform ViTs on small datasets when trained from scratch due to built-in inductive biases (locality, translation equivariance). However, ViT with ImageNet transfer learning surpasses CNNs by a wide margin and converges in fewer epochs.

Per-Class Accuracy Comparison

Class	ResNet-18	ViT (scratch)	ViT (transfer)
Plane	86.40%	86.00%	99.40%
Car	96.70%	91.60%	99.30%
Bird	88.30%	78.10%	98.70%
Cat	84.40%	67.20%	96.40%
Deer	94.70%	81.40%	99.20%
Dog	80.90%	76.40%	98.10%
Frog	91.80%	87.90%	99.60%
Horse	94.40%	87.20%	99.00%
Ship	95.00%	89.40%	99.80%
Truck	92.20%	88.70%	97.90%

Notable pattern: Cat and Dog are the hardest classes across all models but transfer learning closes the gap dramatically (67.2% → 96.4% for Cat). This mirrors the challenge of fine-grained visual distinctions in medical imaging (e.g., tumor grading).

Implications for Medical Imaging

Medical datasets are typically small and expensive to label. This study suggests:

For limited labeled data with no pretrained weights available: CNNs (ResNet) are more reliable trained from scratch and ViT's lack of inductive biases becomes a liability
For any access to pretrained weights: ViT with transfer learning is the stronger choice, even when fine-tuning on a much smaller target domain
Transfer learning is not optional for Transformers on small datasets: It's essential. Extending ViT training from 40 to 60 epochs yielded only +0.30% improvement, while transfer learning yielded +15.36%

Experiment Design

Experiment 1: ResNet-18 from Scratch

Architecture: Custom ResNet-18 with BasicBlock residual connections (11.17M parameters)
Optimizer: SGD (lr=0.1, momentum=0.9, weight_decay=5e-4)
LR Schedule: Cosine annealing over 100 epochs
Augmentation: Random crop (32×32, padding=4) + random horizontal flip
Training time: ~73 minutes (GPU)

Experiment 2: ViT from Scratch

Architecture: Custom Vision Transformer with patch embeddings (4×4 patches on 32×32 input)
Optimizer: AdamW (lr=0.001, weight_decay=0.05)
LR Schedule: Cosine annealing with 5-epoch linear warmup
Training: 40 epochs initial → resumed for 20 additional epochs with lower LR
Training time: ~35 min (40 epochs) + ~18 min (20 epochs)

Experiment 3: ViT with ImageNet Transfer Learning

Architecture: ViT-S/16 (pretrained on ImageNet-1K, via timm)
Input: Resized to 224×224 (ViT's native resolution)
Optimizer: AdamW (lr=0.0001, weight_decay=0.01)
Augmentation: AutoAugment (CIFAR-10 policy) + RandomErasing
Training time: ~262 minutes (30 epochs)

Repository Structure

resnet-vit-analysis/
│
├── notebooks/
│   ├── 01_ResNet18_from_scratch.ipynb        # ResNet-18 training + evaluation
│   ├── 02_ViT_from_scratch.ipynb             # ViT training (40 + 20 epochs)
│   └── 03_ViT_transfer_learning.ipynb        # ViT-S/16 ImageNet fine-tuning
├── requirements.txt
└── README.md

Getting Started

git clone https://github.com/JananiV3010/resnet-vit-analysis.git
cd resnet-vit-analysis
pip install -r requirements.txt

Open any notebook in Google Colab (GPU recommended) or Jupyter:

jupyter notebook notebooks/

Tech Stack

Python PyTorch timm CIFAR-10 ResNet-18 Vision Transformer (ViT) ImageNet scikit-learn Google Colab

Author

Janani Vaiyapuriappan, MSE Biomedical Engineering, Johns Hopkins University
LinkedIn · GitHub

Course project — Machine Perception, Johns Hopkins University (Fall 2025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning Architecture Analysis: ResNet-18 vs. Vision Transformers (ViT)

Overview

Results

Per-Class Accuracy Comparison

Implications for Medical Imaging

Experiment Design

Experiment 1: ResNet-18 from Scratch

Experiment 2: ViT from Scratch

Experiment 3: ViT with ImageNet Transfer Learning

Repository Structure

Getting Started

Tech Stack

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
01_ResNet18_from_scratch.ipynb		01_ResNet18_from_scratch.ipynb
02_ViT_from_scratch.ipynb		02_ViT_from_scratch.ipynb
03_ViT_transfer_learning.ipynb		03_ViT_transfer_learning.ipynb
README.md		README.md
gitignore		gitignore
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Architecture Analysis: ResNet-18 vs. Vision Transformers (ViT)

Overview

Results

Per-Class Accuracy Comparison

Implications for Medical Imaging

Experiment Design

Experiment 1: ResNet-18 from Scratch

Experiment 2: ViT from Scratch

Experiment 3: ViT with ImageNet Transfer Learning

Repository Structure

Getting Started

Tech Stack

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages