Bert-Based E-mail Classifier

Final Project for Technion's ECE Deep Learning Course (046211)

Getting Started

git clone https://github.com/GalGranot/BERT-EmailClassifier.git
pip install -r requirements.txt

from src.utils import (
    classify_email,
    get_top_tokens_from_text,
    clean_email,
    example_usage_clean)

Introduction

This repository contains an email proccessing transformer, based on the Bidirectional Encoder Representations from Transformers (BERT) language model. The model, based on the attention architecture, attempts to classify legitimate emails from malicious phishing attempts, and can be used as an additional layer of protection against these types of attacks.

This project was written as part of the Technion's ECE Deep Learning course (046211).

A short video summarizing the project can be found here.

We've used this Kaggle database as a baseline for performance - it contains 82k emails with roughly equal amounts of legitimate and phishing/spam emails.

Background

The goal of this project is to develop an effective deep learning model for phishing email detection. By leveraging transformer models, we aim to accurately classify emails and highlight key words that contribute most to the classification decision.

Phishing attacks continue to be a significant cybersecurity threat, targeting both individuals and organizations. Traditional rule-based systems often are not enough to defend the average email user from the evolving nature of phishing techniques, as well as provide semantic context to the user. Machine learning models, particularly large language models based on transformers, offer a promising solution due to their ability to capture contextual information in text. Furthermore, enhancing model explainability and providing transparency into the decision-making process as seen by the user is crucial for gaining the latter’s trust and ensuring that the model's decisions can be validated.

We fine-tuned a pre-trained BERT model for our phishing detection task. BERT is a transformer-based model designed to capture deep bidirectional context of each token by jointly conditioning on both left and right context in all layers. This enables BERT to capture rich contextual information, making it ideal for text classification tasks.

Our workflow consisted of using the pre-trained BERT model with an added classification layer and training it (while maintaining some weights as no-decay) on the email database. After conducting the hyperparameter search, we trained the optimized model on the dataset, validated it during training using the validation set, while trying not to overfit the model in order to improve generalization.

Results

Our model achieved an accuracy rate of 98% on the test dataset, in addition to highlighting critical words that contributed to the model’s decision-making process.

Attention visualization:

We've also visualized the attention mechanism using BertViz. In this tool the strength of the semantic connection between tokens is visualized as lines betweens tokens, with stronger lines indicating stronger connections. We've used this a visual aid to the user in providing explainability of the model's decisions.

Training & Validation Loss:

Confusion Matrices:

Out of 1312 datapoints in the test dataset, the model successfully classified 1286 datapoints. The false-negative and false-positive rates are very comparable (11 false-negatives and 15 false-positives), which suggests strong confidence in the predictions as well as a strong generalization ability.

How to run

email = '''
Hello, welcome to the Bert-Based E-mail Classifier! This is an example 
usage for the classify_email method. We'll clean this email and classify it
'''
print(classify_email(email))

File Overview

Directory	File	Usage
.	README.md	this file
.	requirements.txt	pip install requirements file
.	BERT Email Classifier.pdf	project report and details
.	BERT Email Classifier Presentation.pdf	project presentation
research	project.ipynb	notebook interface for running project
research	project_raw.ipynb	our process of developing the project
src	utils.py	APIs for classifying and preprocessing emails
src	train.py	training model functions
model	.	trained model weights and parameters

Steps to Set Up the Project

Installation Commands From Command Line

Step	Command
Install `torchdata`	`pip install torchdata`
Install `portalocker`	`pip install portalocker`
Install `kaggle`	`pip install kaggle`
Set up `kaggle.json`	`mkdir -p ~/.kaggle` `cp kaggle.json ~/.kaggle/` `chmod 600 ~/.kaggle/kaggle.json`
Download dataset	`kaggle datasets download -d subhajournal/phishingemails`
Install `bertviz`	`pip install bertviz`
Install `jupyterlab`	`pip install jupyterlab`
Install `ipywidgets`	`pip install ipywidgets`
Unzip dataset	`unzip /content/phishingemails.zip`

Libraries and Versions

Library	Version
`numpy`	`1.25.0`
`pandas`	`2.1.0`
`matplotlib`	`3.7.2`
`torch`	`2.0.1`
`scikit-learn`	`1.2.2`
`transformers`	`4.31.0`
`torchtext`	`0.16.0`
`bertviz`	`1.4.0`

Sources

• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30. This paper introduces the Transformer architecture, which serves as the foundation for models like BERT.

• Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 4171-4186. This paper describes the BERT model used in our project for text classification tasks.

• Kaggle Phishing Emails Dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bert-Based E-mail Classifier

Final Project for Technion's ECE Deep Learning Course (046211)

Getting Started

Introduction

Background

Results

Attention visualization:

Training & Validation Loss:

Confusion Matrices:

How to run

File Overview

Steps to Set Up the Project

Installation Commands From Command Line

Libraries and Versions

Sources

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
images		images
model		model
research		research
src		src
.gitattributes		.gitattributes
BERT Email Classifier Presentation.pdf		BERT Email Classifier Presentation.pdf
BERT Email Classifier.pdf		BERT Email Classifier.pdf
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Bert-Based E-mail Classifier

Final Project for Technion's ECE Deep Learning Course (046211)

Getting Started

Introduction

Background

Results

Attention visualization:

Training & Validation Loss:

Confusion Matrices:

How to run

File Overview

Steps to Set Up the Project

Installation Commands From Command Line

Libraries and Versions

Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages