Skip to content

Itamar-Horowitz/deep-recaptcha-solver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ReCAPTCHA V2 solver using DINOv2 and MambaVision

Table of Contents

Abstract

In today’s digital world, safeguarding online platforms is increasingly vital due to the rise in automated attacks. ReCAPTCHA v2, a widely used security tool from Google, helps distinguish between human users and bots, thus protecting websites from various malicious activities. To assess the effectiveness of ReCAPTCHA v2 and identify potential vulnerabilities, our project investigates advanced machine learning techniques for solving ReCAPTCHA v2 challenges. We compare DINOv2, optimized through hyperparameter tuning, with MambaVision, which utilizes transfer learning. Our analysis reveals that DINOv2 outperforms MambaVision and other models in solving ReCAPTCHA v2 challenges, demonstrating its ability to produce robust, high-performance visual features for computer vision tasks without relying on transfer learning.

Dataset

The dataset utilized for this project was sourced from Kaggle and includes 40,022 images categorized into 11 distinct classes relevant to ReCAPTCHA challenges. We organized the dataset into training, validation, and testing sets, allocating 70% for training, 15% for validation, and 15% for testing. Normalization procedures were applied to ensure that features followed a normal distribution, and images were resized to 98x98 pixels to match the input requirements of our models.

Below is an overview of our dataset distribution:

Class Train Validation Test Total
Bicycle 1705 365 366 2436
Bridge 1267 271 273 1811
Bus 5862 1256 1257 8375
Car 6776 1452 1453 9681
Chimney 272 58 59 389
Crosswalk 2307 494 495 3296
Hydrant 4872 1044 1045 6961
Motorcycle 193 41 43 277
Palm 1791 383 385 2559
Stair 450 96 98 644
Traffic Light 2515 538 540 3593
Total 28010 5998 6014 40022

DINOv2

DINOv2 is a self-supervised learning framework that builds on the original DINO (Distillation with No Labels) approach, leveraging Vision Transformers (ViTs) to extract meaningful features from images without requiring labeled data. The key innovations of DINOv2 include multi-crop training, which allows the model to handle images at various resolutions, and momentum encoders, which help the model maintain stable representations across training iterations. By focusing on these methods, DINOv2 is able to outperform traditional convolutional networks and supervised transformers in tasks such as image classification, object detection, and segmentation.

Additionally, DINOv2 uses a contrastive learning technique to distinguish between similar and dissimilar images in the feature space, which enhances its ability to categorize images accurately. With its powerful self-distillation process, DINOv2 can generate rich visual representations, making it highly effective for complex vision tasks.

MambaVision

MambaVision is a cutting-edge model that employs a combination of self-attention mechanisms and mixer blocks to process visual data. It introduces an innovative selective scan algorithm, which filters out irrelevant information, along with a hardware-aware algorithm to optimize memory and processing performance. Unlike DINOv2, MambaVision relies on transfer learning to adapt pre-trained models for specific tasks, which can make it more suitable for cases where fine-tuning on smaller datasets is necessary.

One of MambaVision's core strengths is its ability to model global contexts in images, thanks to its hierarchical structure and selective SSM (State-Space Model). The architecture includes a mix of self-attention paths and convolutional layers, creating a flexible model capable of handling complex visual tasks.

Results

The table below summarizes the performance of the models used in this project:

Model Technique Test Accuracy
DINOv2 Hyperparameters tuning (Optuna) 95.9%
MambaVision Fine-tuning (DoRA) 83.8%
ResNet18 Fine-tuning (DoRA) 77.8%

The results clearly show that DINOv2 is the most effective model for solving ReCAPTCHA challenges, even without using transfer learning. Its ability to process unlabeled data and generate rich visual representations allows it to surpass MambaVision and ResNet18 in accuracy. While MambaVision showed solid performance, it required more fine-tuning and was more sensitive to low-quality data compared to DINOv2.

Prerequisites

Before using this project, ensure you have installed the following libraries and dependencies:

Library Version
Python 3.5.5 or later
torch 2.1.2 or later
torchvision 0.15.0
causal-conv1d 1.4.0
mamba-ssm 2.2.2
timm 0.9.2 or later
tensorboardX 2.6 or later
einops 0.6.1 or later
transformers 4.42.3
torchmetrics 0.10.3
kornia 0.7.3
matplotlib 3.7.2
numpy 1.23.5
pandas 2.1.1
seaborn 0.13.0
h5py 3.10.0
librosa 0.10.2

Quick start

  1. Clone the repository:
    git clone https://github.com/Itamar-Horowitz/deep-recaptcha-solver.git
    
  2. Navigate to the models folder:
    cd ./models
    
  3. Run the desired model (example):
    jupyter notebook DINOv2Model.ipynb
    

References

About

ReCAPTCHA V2 solver using DINOv2 and MambaVision

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors