Alexandr Yermakov & Duc Do - USRA 2025
Supervisor: Dr. Carson Leung
Database and Data Mining Lab - University of Manitoba
The RSNA Breast Cancer Detection dataset contains 54,706 DICOM mammograms from 11,913 patients, along with metadata about each patient and scan. Among these, only 1,158 scans (approximately 2%) were identified as containing cancer, distributed across 486 patients.
The main goal of this project is to identify cases of breast cancer. Our initial objective was to build a machine learning pipeline that achieves good performance metrics. We focused on experimenting with different deep learning architectures such as ViT, ResNet50, and EfficientNetB0, and on trying several augmentation and regularization techniques. However, I later realized that this should not have been our main focus, and that our approach should have been different (more on this later).
As seen from the dataset, the data is highly imbalanced for a binary classification task. To address this, we tried several techniques including oversampling, weighted sampling, and the use of external data. Moreover, cancer tumors are often not visible in many scans (especially at early stages), which makes this dataset particularly difficult. On the competition leaderboard, the top pF1 score was only around 0.55.
Eventually, our results were not good enough. On a test set with a similar distribution to the original RSNA dataset (2.1% positives), the single-scan model achieved a pF1 of around 15%, while the multi-view model (treating a breast as a unit rather than a single scan) achieved a pF1 of around 20%.
The code implementation can be found at https://github.com/alex-and-ye/RSNA-USRA25/. Based on the source code from the paper Multiple Multi-Modal Methods of Malignant Mammogram Classification, we re-implemented the structure to follow object-oriented programming (OOP) principles, separating modules and stages for easier maintenance and modification.
We also implemented many new features such as:
- checkpoint saving
- k-fold cross-validation
- new models, metrics, and loss functions
- threshold tuning
- and other techniques for model improvement.
We implemented the following metrics: Accuracy, Balanced Accuracy, pF1, F1, Macro F1, AUC-ROC, AUC-PR, Recall, and Precision.
Accuracy
Balanced Accuracy
Precision
Recall (Sensitivity or True Positive Rate)
F1 Score
Macro F1 Score
$$ \text{Macro F1} = \frac{1}{C} \sum_{i=1}^{C} F1_i $$ where ( C ) is the number of classes.
Probabilistic F1 (pF1)
where
- (
$p_i$ in [0,1]) is the predicted probability for sample (i) - (
$y_i$ in {0,1}) is the ground-truth label for sample (i)
AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
AUC-PR (Area Under the Precision-Recall Curve)
We usually chose pF1 as the main metric to select checkpoints and update the learning rate scheduler, since it is threshold-independent and represents performance on the positive class, which is crucial for cancer classification.
Because the classes are imbalanced, we relied on balanced accuracy instead of raw accuracy. However, a high balanced accuracy can be achieved by favoring recall over precision, which can lead to poor pF1. For example, one of our models (trained only on 3000 samples with external postive samples for experiment) achieved a balanced accuracy above 85% but an F1 below 20%. This shows that high balanced accuracy alone can be misleading. Therefore, pF1 was used as our main performance metric.
For the single-scan model, we experimented with the following architectures: EfficientNetB0, DenseNet121, ResNet50, ViT, and ConvNeXt-Tiny.
| Model | Architecture Type | Approx. Parameters | Year |
|---|---|---|---|
| EfficientNetB0 | CNN-based | 5.3M | 2019 |
| DenseNet121 | CNN-based | 8.1M | 2017 |
| ResNet50 | CNN-based | 25.6M | 2015 |
| ViT | Transformer-based | 86M | 2021 |
| ConvNeXt-Tiny | CNN-based | 28.6M | 2022 |
Multi-view model: Cancer-related findings may not be visible in both views for all positive cases.
Each feature extractor used an EfficientNetV2-Small backbone with the last one or two blocks unfrozen (transfer learning), to reduce the number of trainable parameters while still allowing fine-tuning.
We experimented with the following techniques:
- proposed patching method
- soft labels
- weighted sampler
- adding positive samples from external datasets
- various augmentation methods
- and different loss functions such as BCE and Focal Loss
However, none of these approaches significantly improved the model’s performance.
- Destructive patching with a size of 16×16 performed slightly better than no patching or 32×32, though the difference was not significant (less than 3% test F1).
- Model performance ranking: EfficientNetB0 > DenseNet121 > ResNet50 > ViT = ConvNeXt-Tiny. → Lighter models tended to perform better, while larger models overfitted more easily.
- No significant difference between using DICOM (converted to NumPy arrays) and PNG formats.
- Augmentation techniques such as class weighting or oversampling did not improve performance on the positive class, likely because the augmentations (flip, rotation, brightness, noise, etc.) were not strong enough.
- Adding external positive samples to the training set did not help. The model tended to distinguish external data from the RSNA dataset and predict all external samples as positive, while performance on the RSNA positive cases remained poor. This is likely due to a distribution shift between datasets.
- Models trained on balanced data (for example, 21.37% positives in 3,000 samples) achieved a pF1 of about 0.48, but performance dropped sharply when tested on real-world imbalanced data.
The table below shows results from two experiments: (1) a Single Scan model using EfficientNet-B0, with checkpoint selection based on Balanced Accuracy, and (2) a Multiview model with an EfficientNetV2-S backbone trained via transfer learning, with checkpoint selection based on pF1 score.
| Metric | Single Scan (EfficientNet-B0) | Multiview (EffNetV2-S) |
|---|---|---|
| Accuracy | 0.8107 | 0.9735 |
| Balanced Accuracy | 0.7093 | 0.6162 |
| F1 Score | 0.1190 | 0.3614 |
| Macro F1 Score | 0.5065 | 0.6740 |
| Recall | 0.6034 | 0.2344 |
| Precision | 0.0660 | 0.7895 |
| AUC-ROC | 0.7703 | 0.6999 |
Confusion matrix table for the Multiview model:
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | TP = 15 | FN = 49 |
| Actual: Negative | FP = 4 | TN = 1932 |
Overall, the models did not generalize well and tended to overfit the training data.
For a test set with around 2% positives (RSNA distribution), our model achieved a balanced accuracy of 70.9%, which is slightly higher than the 70.2% benchmark reported in the paper (using ResNet50). However, the main metric pF1 (or F1) did not exceed 15%, which is clearly not sufficient for practical use.
As mentioned earlier, our approach should have been different. Instead of focusing on model performance, we should have focused on the knowledge and insights that can be obtained from this dataset. For example:
- examining how each feature contributes to the presence of cancer
- identifying which scan views are more likely to show cancer
- determining which features are most important for detection
This direction would lead to more explainable and interpretable models, which are far more meaningful for clinical applications than simply achieving high accuracy. In short, we were too focused and rushed in building models without developing a deep understanding of the dataset itself.
These points should guide our future work - or serve as direction for future research students continuing this project.
