Phishing Email Detection with Multimodal Deep Learning

ENPM703 — Fundamentals of AI and Deep Learning | Fall 2025

A dual-tower fusion deep learning system that detects phishing emails by jointly analyzing email text, embedded brand logos, and engineered metadata — achieving 99.45% accuracy and AUC 0.999 on a balanced dataset of 76,346 emails.

Overview

Phishing attacks remain one of the most prevalent cyber threats. Traditional text-only filters fail when attackers mimic legitimate brand emails visually. This project addresses that gap with a multimodal approach that fuses three complementary signal types:

Modality	What it captures	Output dim
Email Text	Linguistic patterns, urgency, vocabulary	256-d
Brand Logo	Visual brand impersonation via embedded images	512-d
Metadata	URL count, capitalization ratio, keyword signals	64-d (projected from 20)

The three towers are pre-trained independently as specialists, then fused in a joint classifier that achieves near-perfect detection.

Architecture

                    ┌─────────────────────────────────────────────────────┐
                    │                  EMAIL INPUT                        │
                    └──────────┬────────────────┬──────────────┬──────────┘
                               │                │              │
                    ┌──────────▼──────┐  ┌──────▼──────┐  ┌───▼────────────┐
                    │   TEXT TOWER    │  │ IMAGE TOWER │  │ METADATA MLP   │
                    │  (Custom CNN)   │  │ (Custom CNN)│  │  (20-dim feat) │
                    │                 │  │             │  │                │
                    │  Embed(128)     │  │ Conv 3→64   │  │  Linear(20,64) │
                    │  Conv1D ×4      │  │ Conv 64→128 │  │  ReLU          │
                    │  GlobalAvgPool  │  │ Conv128→256 │  │                │
                    │  Linear→256     │  │ Conv256→512 │  └───────┬────────┘
                    └──────────┬──────┘  │ AvgPool→512 │          │
                               │         └──────┬──────┘          │
                               │  256-d         │  512-d          │  64-d
                               └────────────────┴──────────────────┘
                                                │
                                         Concat (832-d)
                                                │
                                    ┌───────────▼───────────┐
                                    │    FUSION CLASSIFIER   │
                                    │  Linear(832→512) + BN  │
                                    │  Linear(512→256) + BN  │
                                    │  Linear(256→128) + BN  │
                                    │  Linear(128→2)         │
                                    └───────────┬────────────┘
                                                │
                                    ┌───────────▼────────────┐
                                    │  Phishing / Legitimate  │
                                    └────────────────────────┘

Training strategy:

Phase 1 — Text and image towers are trained independently as specialist classifiers.
Phase 2 — Email and logo datasets are aligned into a unified multimodal dataset.
Phase 3 — Towers are loaded with frozen weights; the fusion classifier is trained. Then the full network is fine-tuned end-to-end.

Dataset

Emails

Source	Type	Approx. Count
CEAS_08	Spam / Phishing	~17,000
Enron	Legitimate	~18,000
Nazario	Phishing	~2,000
Nigerian Prince	Phishing	~4,000
Total (balanced)	50% phishing / 50% legitimate	76,346

Brand Logos (Image Tower)

Source: OpenLogo dataset
72,652 brand logo images across 352 brand classes
Resized to 224×224 and normalized with ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

Metadata Features (20-dim vector)

Text length, subject length, body length, URL count, shortened-URL flag, suspicious domain keywords, urgency-word count, action-phrase count, financial-keyword count, capitalization ratio, exclamation-mark count, dollar-sign count, word count, and 7 additional engineered binary/continuous signals.

Results

Model Performance Comparison

Model	Accuracy	Notes
KNN (text features)	81.71%	Baseline
Logistic Regression	80.00%	Baseline
Custom Text CNN	98.96%	Phase 1 text specialist
Custom Image CNN	76.30%	Phase 1 image specialist
ResNet18 (transfer learning)	97.43%	Comparison baseline
Dual-Tower Fusion	99.45%	Final model

Fusion Model Detailed Metrics

Metric	Score
Accuracy	99.45%
AUC-ROC	0.999
Precision (phishing class)	99.5%
Recall (phishing class)	99.4%
F1-Score	99.4%

Repository Structure

Phishing-Email-Detection-Multimodal-Deep-Learning/
│
├── notebooks/                              # Training notebooks — run in order
│   ├── Final_CNN_Text_1.ipynb                    # Phase 1A: Train text CNN specialist
│   ├── Final_CNN_Images_Custom.ipynb             # Phase 1B: Train image CNN (custom)
│   ├── Final_CNN_Images_Resnet18.ipynb           # Phase 1B alt: ResNet18 comparison
│   ├── dual_tower_text_features.ipynb            # Phase 2: Build unified multimodal dataset
│   ├── train_fusion3.ipynb                       # Phase 3: Train dual-tower fusion model
│   └── baselines/
│       ├── Final_KNN_Text.ipynb                  # KNN baseline
│       └── text_tower_knn.ipynb                  # KNN text tower variant
│
├── src/                                    # Reusable Python modules
│   ├── fusion_models.py          # DualTowerFusionModel, TextFeatureExtractor, ImageFeatureExtractor
│   ├── fusion_dataset_v2.py      # PyTorch Dataset for multimodal training
│   ├── brand_extractor.py        # Extract brand names from email text
│   ├── brand_logo_mapper.py      # Map brand names to logo file paths
│   ├── build_brand_index.py      # Build brand→images JSON index
│   ├── email_ratio.py            # Email dataset balance utilities
│   └── __init__.py
│
├── inference/
│   └── preprocess_html_and_predict.py  # End-to-end inference on raw .html email files
│
├── data/
│   ├── vocab_text_1.json               # Text CNN vocabulary (word→index)
│   ├── class_to_idx_image_custom.json  # Image CNN label map (custom CNN)
│   ├── class_to_idx_image_resnet18.json# Image CNN label map (ResNet18)
│   ├── brand_to_images.json            # Brand→logo file paths index
│   ├── cleaned_combined_emails.csv     # Preprocessed email dataset           [git-lfs]
│   └── unified_multimodal_text.csv     # Unified multimodal training dataset  [git-lfs]
│
├── models/
│   ├── best_custom_cnn_text_1.pth      # Text specialist weights  (~104 MB)   [git-lfs]
│   ├── best_custom_cnn_image_custom.pth# Image specialist weights (~31 MB)    [git-lfs]
│   └── best_fusion_model.pth           # Final fusion model weights (~137 MB) [git-lfs]
│
├── docs/
│   ├── Project_Report.pdf              # Full project report
│   ├── Architecture_Report.pdf         # System architecture report
│   ├── Presentation.pdf                # Project presentation slides
│   ├── Contribution_Report.pdf         # Team contribution report
│   ├── Presentation_Recording.mp4      # Presentation recording
│   └── Demo_Recording.mp4              # Live demo recording
│
├── requirements.txt
├── .gitattributes                      # git-lfs tracking rules
├── .gitignore
└── README.md

Files marked [git-lfs] are tracked with Git Large File Storage. Run git lfs pull after cloning to download them.

Installation

Prerequisites

Python 3.10+
CUDA-capable GPU (recommended; CPU inference is supported but slower)
Git LFS for model weights and large datasets

Clone and Install

# 1. Install Git LFS (if not already installed)
git lfs install

# 2. Clone — LFS files download automatically
git clone https://github.com/akashsv01/Phishing-Email-Detection-Multimodal-Deep-Learning.git
cd Phishing-Email-Detection-Multimodal-Deep-Learning

# 3. If LFS files did not download automatically
git lfs pull

# 4. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # Linux / macOS
# venv\Scripts\activate         # Windows

# 5. Install dependencies
pip install -r requirements.txt

Reproducing the Pipeline

Run the notebooks in order. Each phase produces artifacts consumed by the next.

Phase 1 — Specialist Pre-training

1A. Text Specialist CNN

Notebook: notebooks/Final_CNN_Text_1.ipynb

Input: data/cleaned_combined_emails.csv
Builds a character-level vocabulary and trains a 4-block 1D CNN on tokenized email text
Outputs: models/best_custom_cnn_text_1.pth, data/vocab_text_1.json
Result: 98.96% accuracy

1B. Image Specialist CNN

Notebook: notebooks/Final_CNN_Images_Custom.ipynb

Input: OpenLogo brand logo dataset (download separately — see note below)
Trains a VGG-style 4-block 2D CNN on 224×224 logo images
Outputs: models/best_custom_cnn_image_custom.pth, data/class_to_idx_image_custom.json
Result: 76.30% accuracy (352-class logo classification)

OpenLogo dataset: Not included due to size (~2 GB). Download from qmul-openlogo.github.io and set the path in the notebook.

Comparison baseline: notebooks/Final_CNN_Images_Resnet18.ipynb — uses ResNet18 transfer learning (97.43% accuracy on the email classification task).

Phase 2 — Multimodal Data Integration

Notebook: notebooks/dual_tower_text_features.ipynb

Extracts brand names from each email using src/brand_extractor.py
Maps each email to its most relevant brand logo file path via data/brand_to_images.json
Merges text features with image paths and metadata into a single aligned dataset
Output: data/unified_multimodal_text.csv

Phase 3 — Fusion Model Training

Notebook: notebooks/train_fusion3.ipynb

Loads pre-trained text and image tower weights from Phase 1
Stage 1: Freezes both towers; trains only the fusion classifier and metadata MLP
Stage 2: Unfreezes towers; fine-tunes the full network end-to-end
Output: models/best_fusion_model.pth
Result: 99.45% accuracy, AUC 0.999

Running Inference

Classify a raw .html email file using the trained fusion model:

python inference/preprocess_html_and_predict.py path/to/email.html

Pipeline (fully automatic):

Parse HTML → extract subject, body text, and embedded/linked images
Tokenize text using data/vocab_text_1.json
Decode and resize the first image to 224×224
Extract 20 metadata features (URL patterns, keyword signals, character statistics)
Run all three tensors through the fusion model
Print verdict + confidence + detected suspicious signals
Save a .txt report alongside the input file

Example output:

======================================================================
Analyzing: suspicious_email.html
======================================================================

Step 1: Parsing HTML...
   Subject: Your account has been suspended - Immediate action required...
   Body length: 2847 chars
   Images found: 2

Step 2: Processing text...
Step 3: Processing image...
Step 4: Extracting metadata...
Step 5: Running fusion model...

======================================================================
ANALYSIS RESULTS
======================================================================

Prediction: PHISHING
Confidence: 98.73%

Detected Issues:
  Contains shortened URLs (bit.ly, tinyurl)
  High urgency language (5 urgent keywords)
  Multiple call-to-action phrases
  Excessive capitalization (34.2%)
======================================================================

Live Demo

A Streamlit application is deployed on Hugging Face Spaces — no installation required:

https://huggingface.co/spaces/anilawork/phish-detection-ui-final

Upload any .html email file to receive an instant phishing verdict with confidence score and suspicious signal breakdown.

Team

References

OpenLogo Dataset — Queen Mary University of London
https://qmul-openlogo.github.io/
CEAS 2008 Spam Filtering Challenge
Phishing Email Dataset — Naser Abdullah Alam
https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset
PyTorch — https://pytorch.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phishing Email Detection with Multimodal Deep Learning

Table of Contents

Overview

Architecture

Dataset

Emails

Brand Logos (Image Tower)

Metadata Features (20-dim vector)

Results

Model Performance Comparison

Fusion Model Detailed Metrics

Repository Structure

Installation

Prerequisites

Clone and Install

Reproducing the Pipeline

Phase 1 — Specialist Pre-training

1A. Text Specialist CNN

1B. Image Specialist CNN

Phase 2 — Multimodal Data Integration

Phase 3 — Fusion Model Training

Running Inference

Live Demo

Team

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
docs		docs
inference		inference
models		models
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Phishing Email Detection with Multimodal Deep Learning

Table of Contents

Overview

Architecture

Dataset

Emails

Brand Logos (Image Tower)

Metadata Features (20-dim vector)

Results

Model Performance Comparison

Fusion Model Detailed Metrics

Repository Structure

Installation

Prerequisites

Clone and Install

Reproducing the Pipeline

Phase 1 — Specialist Pre-training

1A. Text Specialist CNN

1B. Image Specialist CNN

Phase 2 — Multimodal Data Integration

Phase 3 — Fusion Model Training

Running Inference

Live Demo

Team

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages