A systematic benchmarking study comparing CNNs and Vision Transformers on CIFAR-10 with implications for architecture selection in medical imaging.
Choosing between CNNs and Vision Transformers is one of the most consequential decisions in modern medical imaging AI. This project provides a rigorous empirical comparison of ResNet-18 and ViT under three training conditions, deriving practical insights for small-dataset medical imaging contexts.
| Model | Training Approach | Best Test Accuracy | Epochs |
|---|---|---|---|
| ResNet-18 | From scratch | 90.48% | 100 |
| ViT (from scratch) | From scratch | 83.39% | 40 |
| ViT (from scratch, extended) | Extended training | 83.69% | 60 |
| ViT-S/16 (transfer learning) | ImageNet → CIFAR-10 | 98.75% | 30 |
Key Finding: CNNs outperform ViTs on small datasets when trained from scratch due to built-in inductive biases (locality, translation equivariance). However, ViT with ImageNet transfer learning surpasses CNNs by a wide margin and converges in fewer epochs.
| Class | ResNet-18 | ViT (scratch) | ViT (transfer) |
|---|---|---|---|
| Plane | 86.40% | 86.00% | 99.40% |
| Car | 96.70% | 91.60% | 99.30% |
| Bird | 88.30% | 78.10% | 98.70% |
| Cat | 84.40% | 67.20% | 96.40% |
| Deer | 94.70% | 81.40% | 99.20% |
| Dog | 80.90% | 76.40% | 98.10% |
| Frog | 91.80% | 87.90% | 99.60% |
| Horse | 94.40% | 87.20% | 99.00% |
| Ship | 95.00% | 89.40% | 99.80% |
| Truck | 92.20% | 88.70% | 97.90% |
Notable pattern: Cat and Dog are the hardest classes across all models but transfer learning closes the gap dramatically (67.2% → 96.4% for Cat). This mirrors the challenge of fine-grained visual distinctions in medical imaging (e.g., tumor grading).
Medical datasets are typically small and expensive to label. This study suggests:
- For limited labeled data with no pretrained weights available: CNNs (ResNet) are more reliable trained from scratch and ViT's lack of inductive biases becomes a liability
- For any access to pretrained weights: ViT with transfer learning is the stronger choice, even when fine-tuning on a much smaller target domain
- Transfer learning is not optional for Transformers on small datasets: It's essential. Extending ViT training from 40 to 60 epochs yielded only +0.30% improvement, while transfer learning yielded +15.36%
- Architecture: Custom ResNet-18 with BasicBlock residual connections (11.17M parameters)
- Optimizer: SGD (lr=0.1, momentum=0.9, weight_decay=5e-4)
- LR Schedule: Cosine annealing over 100 epochs
- Augmentation: Random crop (32×32, padding=4) + random horizontal flip
- Training time: ~73 minutes (GPU)
- Architecture: Custom Vision Transformer with patch embeddings (4×4 patches on 32×32 input)
- Optimizer: AdamW (lr=0.001, weight_decay=0.05)
- LR Schedule: Cosine annealing with 5-epoch linear warmup
- Training: 40 epochs initial → resumed for 20 additional epochs with lower LR
- Training time: ~35 min (40 epochs) + ~18 min (20 epochs)
- Architecture: ViT-S/16 (pretrained on ImageNet-1K, via
timm) - Input: Resized to 224×224 (ViT's native resolution)
- Optimizer: AdamW (lr=0.0001, weight_decay=0.01)
- Augmentation: AutoAugment (CIFAR-10 policy) + RandomErasing
- Training time: ~262 minutes (30 epochs)
resnet-vit-analysis/
│
├── notebooks/
│ ├── 01_ResNet18_from_scratch.ipynb # ResNet-18 training + evaluation
│ ├── 02_ViT_from_scratch.ipynb # ViT training (40 + 20 epochs)
│ └── 03_ViT_transfer_learning.ipynb # ViT-S/16 ImageNet fine-tuning
├── requirements.txt
└── README.md
git clone https://github.com/JananiV3010/resnet-vit-analysis.git
cd resnet-vit-analysis
pip install -r requirements.txtOpen any notebook in Google Colab (GPU recommended) or Jupyter:
jupyter notebook notebooks/Python PyTorch timm CIFAR-10 ResNet-18 Vision Transformer (ViT) ImageNet scikit-learn Google Colab
Janani Vaiyapuriappan, MSE Biomedical Engineering, Johns Hopkins University
LinkedIn · GitHub
Course project — Machine Perception, Johns Hopkins University (Fall 2025)