RSNA Breast Cancer Detection - Project Report

Alexandr Yermakov & Duc Do - USRA 2025

Supervisor: Dr. Carson Leung

Database and Data Mining Lab - University of Manitoba

1. Overview

The RSNA Breast Cancer Detection dataset contains 54,706 DICOM mammograms from 11,913 patients, along with metadata about each patient and scan. Among these, only 1,158 scans (approximately 2%) were identified as containing cancer, distributed across 486 patients.

The main goal of this project is to identify cases of breast cancer. Our initial objective was to build a machine learning pipeline that achieves good performance metrics. We focused on experimenting with different deep learning architectures such as ViT, ResNet50, and EfficientNetB0, and on trying several augmentation and regularization techniques. However, I later realized that this should not have been our main focus, and that our approach should have been different (more on this later).

As seen from the dataset, the data is highly imbalanced for a binary classification task. To address this, we tried several techniques including oversampling, weighted sampling, and the use of external data. Moreover, cancer tumors are often not visible in many scans (especially at early stages), which makes this dataset particularly difficult. On the competition leaderboard, the top pF1 score was only around 0.55.

Eventually, our results were not good enough. On a test set with a similar distribution to the original RSNA dataset (2.1% positives), the single-scan model achieved a pF1 of around 15%, while the multi-view model (treating a breast as a unit rather than a single scan) achieved a pF1 of around 20%.

2. Methodologies

The code implementation can be found at https://github.com/alex-and-ye/RSNA-USRA25/. Based on the source code from the paper Multiple Multi-Modal Methods of Malignant Mammogram Classification, we re-implemented the structure to follow object-oriented programming (OOP) principles, separating modules and stages for easier maintenance and modification.

We also implemented many new features such as:

checkpoint saving
k-fold cross-validation
new models, metrics, and loss functions
threshold tuning
and other techniques for model improvement.

2.1 Evaluation Metrics

We implemented the following metrics: Accuracy, Balanced Accuracy, pF1, F1, Macro F1, AUC-ROC, AUC-PR, Recall, and Precision.

Accuracy

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

Balanced Accuracy

$$ \text{Balanced Accuracy} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right) $$

Precision

$$ \text{Precision} = \frac{TP}{TP + FP} $$

Recall (Sensitivity or True Positive Rate)

$$ \text{Recall} = \frac{TP}{TP + FN} $$

F1 Score

$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

Macro F1 Score

$$ \text{Macro F1} = \frac{1}{C} \sum_{i=1}^{C} F1_i $$ where ( C ) is the number of classes.

Probabilistic F1 (pF1)

$$ \text{pF1} = \frac{2 \cdot \text{pPrecision} \cdot \text{pRecall}}{\text{pPrecision} + \text{pRecall}} $$

where

$$ \text{pPrecision} = \frac{\sum_{i: y_i = 1} p_i}{\sum_{i=1}^{N} p_i}, \quad \text{pRecall} = \frac{\sum_{i: y_i = 1} p_i}{\sum_{i=1}^{N} y_i} $$

($p_i$ in [0,1]) is the predicted probability for sample (i)
($y_i$ in {0,1}) is the ground-truth label for sample (i)

AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

$$ \text{AUC-ROC} = \int_{0}^{1} TPR(FPR^{-1}(x)) , dx $$

AUC-PR (Area Under the Precision-Recall Curve)

$$ \text{AUC-PR} = \int_{0}^{1} P(R^{-1}(x)) , dx $$

We usually chose pF1 as the main metric to select checkpoints and update the learning rate scheduler, since it is threshold-independent and represents performance on the positive class, which is crucial for cancer classification.

Because the classes are imbalanced, we relied on balanced accuracy instead of raw accuracy. However, a high balanced accuracy can be achieved by favoring recall over precision, which can lead to poor pF1. For example, one of our models (trained only on 3000 samples with external postive samples for experiment) achieved a balanced accuracy above 85% but an F1 below 20%. This shows that high balanced accuracy alone can be misleading. Therefore, pF1 was used as our main performance metric.

2.2 Models

For the single-scan model, we experimented with the following architectures: EfficientNetB0, DenseNet121, ResNet50, ViT, and ConvNeXt-Tiny.

Model	Architecture Type	Approx. Parameters	Year
EfficientNetB0	CNN-based	5.3M	2019
DenseNet121	CNN-based	8.1M	2017
ResNet50	CNN-based	25.6M	2015
ViT	Transformer-based	86M	2021
ConvNeXt-Tiny	CNN-based	28.6M	2022

Multi-view model: Cancer-related findings may not be visible in both views for all positive cases.

Each feature extractor used an EfficientNetV2-Small backbone with the last one or two blocks unfrozen (transfer learning), to reduce the number of trainable parameters while still allowing fine-tuning.

2.3 Techniques Used to Improve the Model

We experimented with the following techniques:

proposed patching method
soft labels
weighted sampler
adding positive samples from external datasets
various augmentation methods
and different loss functions such as BCE and Focal Loss

However, none of these approaches significantly improved the model’s performance.

3. Findings

Destructive patching with a size of 16×16 performed slightly better than no patching or 32×32, though the difference was not significant (less than 3% test F1).
Model performance ranking: EfficientNetB0 > DenseNet121 > ResNet50 > ViT = ConvNeXt-Tiny. → Lighter models tended to perform better, while larger models overfitted more easily.
No significant difference between using DICOM (converted to NumPy arrays) and PNG formats.
Augmentation techniques such as class weighting or oversampling did not improve performance on the positive class, likely because the augmentations (flip, rotation, brightness, noise, etc.) were not strong enough.
Adding external positive samples to the training set did not help. The model tended to distinguish external data from the RSNA dataset and predict all external samples as positive, while performance on the RSNA positive cases remained poor. This is likely due to a distribution shift between datasets.
Models trained on balanced data (for example, 21.37% positives in 3,000 samples) achieved a pF1 of about 0.48, but performance dropped sharply when tested on real-world imbalanced data.

The table below shows results from two experiments: (1) a Single Scan model using EfficientNet-B0, with checkpoint selection based on Balanced Accuracy, and (2) a Multiview model with an EfficientNetV2-S backbone trained via transfer learning, with checkpoint selection based on pF1 score.

Metric	Single Scan (EfficientNet-B0)	Multiview (EffNetV2-S)
Accuracy	0.8107	0.9735
Balanced Accuracy	0.7093	0.6162
F1 Score	0.1190	0.3614
Macro F1 Score	0.5065	0.6740
Recall	0.6034	0.2344
Precision	0.0660	0.7895
AUC-ROC	0.7703	0.6999

Confusion matrix table for the Multiview model:

	Predicted: Positive	Predicted: Negative
Actual: Positive	TP = 15	FN = 49
Actual: Negative	FP = 4	TN = 1932

Overall, the models did not generalize well and tended to overfit the training data.

4. Conclusion and Future Directions

For a test set with around 2% positives (RSNA distribution), our model achieved a balanced accuracy of 70.9%, which is slightly higher than the 70.2% benchmark reported in the paper (using ResNet50). However, the main metric pF1 (or F1) did not exceed 15%, which is clearly not sufficient for practical use.

As mentioned earlier, our approach should have been different. Instead of focusing on model performance, we should have focused on the knowledge and insights that can be obtained from this dataset. For example:

examining how each feature contributes to the presence of cancer
identifying which scan views are more likely to show cancer
determining which features are most important for detection

This direction would lead to more explainable and interpretable models, which are far more meaningful for clinical applications than simply achieving high accuracy. In short, we were too focused and rushed in building models without developing a deep understanding of the dataset itself.

These points should guide our future work - or serve as direction for future research students continuing this project.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
GMIC		GMIC
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
image.png		image.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RSNA Breast Cancer Detection - Project Report

1. Overview

2. Methodologies

2.1 Evaluation Metrics

2.2 Models

2.3 Techniques Used to Improve the Model

3. Findings

4. Conclusion and Future Directions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RSNA Breast Cancer Detection - Project Report

1. Overview

2. Methodologies

2.1 Evaluation Metrics

2.2 Models

2.3 Techniques Used to Improve the Model

3. Findings

4. Conclusion and Future Directions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages