ReCAPTCHA V2 solver using DINOv2 and MambaVision

Abstract

In today’s digital world, safeguarding online platforms is increasingly vital due to the rise in automated attacks. ReCAPTCHA v2, a widely used security tool from Google, helps distinguish between human users and bots, thus protecting websites from various malicious activities. To assess the effectiveness of ReCAPTCHA v2 and identify potential vulnerabilities, our project investigates advanced machine learning techniques for solving ReCAPTCHA v2 challenges. We compare DINOv2, optimized through hyperparameter tuning, with MambaVision, which utilizes transfer learning. Our analysis reveals that DINOv2 outperforms MambaVision and other models in solving ReCAPTCHA v2 challenges, demonstrating its ability to produce robust, high-performance visual features for computer vision tasks without relying on transfer learning.

Dataset

The dataset utilized for this project was sourced from Kaggle and includes 40,022 images categorized into 11 distinct classes relevant to ReCAPTCHA challenges. We organized the dataset into training, validation, and testing sets, allocating 70% for training, 15% for validation, and 15% for testing. Normalization procedures were applied to ensure that features followed a normal distribution, and images were resized to 98x98 pixels to match the input requirements of our models.

Below is an overview of our dataset distribution:

Class	Train	Validation	Test	Total
Bicycle	1705	365	366	2436
Bridge	1267	271	273	1811
Bus	5862	1256	1257	8375
Car	6776	1452	1453	9681
Chimney	272	58	59	389
Crosswalk	2307	494	495	3296
Hydrant	4872	1044	1045	6961
Motorcycle	193	41	43	277
Palm	1791	383	385	2559
Stair	450	96	98	644
Traffic Light	2515	538	540	3593
Total	28010	5998	6014	40022

DINOv2

DINOv2 is a self-supervised learning framework that builds on the original DINO (Distillation with No Labels) approach, leveraging Vision Transformers (ViTs) to extract meaningful features from images without requiring labeled data. The key innovations of DINOv2 include multi-crop training, which allows the model to handle images at various resolutions, and momentum encoders, which help the model maintain stable representations across training iterations. By focusing on these methods, DINOv2 is able to outperform traditional convolutional networks and supervised transformers in tasks such as image classification, object detection, and segmentation.

Additionally, DINOv2 uses a contrastive learning technique to distinguish between similar and dissimilar images in the feature space, which enhances its ability to categorize images accurately. With its powerful self-distillation process, DINOv2 can generate rich visual representations, making it highly effective for complex vision tasks.

MambaVision

MambaVision is a cutting-edge model that employs a combination of self-attention mechanisms and mixer blocks to process visual data. It introduces an innovative selective scan algorithm, which filters out irrelevant information, along with a hardware-aware algorithm to optimize memory and processing performance. Unlike DINOv2, MambaVision relies on transfer learning to adapt pre-trained models for specific tasks, which can make it more suitable for cases where fine-tuning on smaller datasets is necessary.

One of MambaVision's core strengths is its ability to model global contexts in images, thanks to its hierarchical structure and selective SSM (State-Space Model). The architecture includes a mix of self-attention paths and convolutional layers, creating a flexible model capable of handling complex visual tasks.

Results

The table below summarizes the performance of the models used in this project:

Model	Technique	Test Accuracy
DINOv2	Hyperparameters tuning (Optuna)	95.9%
MambaVision	Fine-tuning (DoRA)	83.8%
ResNet18	Fine-tuning (DoRA)	77.8%

The results clearly show that DINOv2 is the most effective model for solving ReCAPTCHA challenges, even without using transfer learning. Its ability to process unlabeled data and generate rich visual representations allows it to surpass MambaVision and ResNet18 in accuracy. While MambaVision showed solid performance, it required more fine-tuning and was more sensitive to low-quality data compared to DINOv2.

Prerequisites

Before using this project, ensure you have installed the following libraries and dependencies:

Library	Version
Python	3.5.5 or later
torch	2.1.2 or later
torchvision	0.15.0
causal-conv1d	1.4.0
mamba-ssm	2.2.2
timm	0.9.2 or later
tensorboardX	2.6 or later
einops	0.6.1 or later
transformers	4.42.3
torchmetrics	0.10.3
kornia	0.7.3
matplotlib	3.7.2
numpy	1.23.5
pandas	2.1.1
seaborn	0.13.0
h5py	3.10.0
librosa	0.10.2

Quick start

Clone the repository:

git clone https://github.com/Itamar-Horowitz/deep-recaptcha-solver.git

Navigate to the models folder:
```
cd ./models
```
Run the desired model (example):
```
jupyter notebook DINOv2Model.ipynb
```

References

MambaVision: https://github.com/NVlabs/MambaVision
DINOv2: https://github.com/facebookresearch/dinov2
Cracking ReCAPTCHA with CNNs: https://medium.com/analytics-vidhya/cracking-recaptchas-with-cnns-and-transfer-learning-edc26ab675ec

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
dataset		dataset
images		images
models		models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReCAPTCHA V2 solver using DINOv2 and MambaVision

Table of Contents

Abstract

Dataset

DINOv2

MambaVision

Results

Prerequisites

Quick start

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReCAPTCHA V2 solver using DINOv2 and MambaVision

Table of Contents

Abstract

Dataset

DINOv2

MambaVision

Results

Prerequisites

Quick start

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages