Malicious vs Benign website - CatBoost classification

Type vs number of special characters
Type vs. URL Length
Mutual Information gain per feature with respect to "Type" column
Correlation Matrix

Inspiration

Malicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics.

What it does

Takes in a set of relevant features pertaining to identification of a website (like url length, number of special characters used, DNS query time etc.) and then a catboost classifier is implemented which classifier whether the website is malicious or benign.

How we built it

Malicious_Website_Recognition

Classifying Malicious website from benign ones using CatBoost Classifier. 1) The data was heavily imbalanced with 88% bias towards benign class (Type=0) and only 12% samples had Type=1 or Malicious website.
2) Process involves Exploration of data, Data Cleaning, Resampling of data (to handle highly imbalanced data), Model implementation and Evaluation.
3) Catboost classifier turned out to be the most robust model giving us approximately the appropriate values for Precision, Recall and F-1 score.

Challenges we ran into

1) Since the data was heavily imbalanced, I had to use an efficient resampling method. 2) To remove high multicollinearity, I created a custom function which removes all those columns which have multicollinearity with other columns 3) To find out the features on which the target value depended the most, I used Mutual Information gain.

Accomplishments that we're proud of

1) Efficiently pre-processed the data (includes handling high imbalance of data, dealing with multicollinearity, handling missing values etc.) 2) Built a model with a high F-1 score and high recall score (since we focused on correctly detecting the minority class and also tried to minimize the False Negative Rate) 3) Pointed out the most relevant features for target variables on the basis of a robust statistical metric of "Mutual Information Gain"

What we learned

1) Extensive hyper-parameter tuning 2) Extensive in-depth understanding of Catboost classification algorithm 3) In-depth understanding of Mutual Information Gain

What's next for Malicious vs Benign website - CatBoost classification

Deployment:

Storing all the relevant function in a .pkl format and then creating a web-interface to deploy the model using an API like Heroku or cloud platform like AWS etc..

Containerizing the application to make it robust

Built With

catboost
csv
imblearn
matplotlib
numpy
pandas
pipeline
python
scikit-learn
seaborn

Updates

Rakshit Sinha started this project — Feb 20, 2022 03:31 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.