Inspiration
Malicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics.
What it does
Takes in a set of relevant features pertaining to identification of a website (like url length, number of special characters used, DNS query time etc.) and then a catboost classifier is implemented which classifier whether the website is malicious or benign.
How we built it
Malicious_Website_Recognition
Classifying Malicious website from benign ones using CatBoost Classifier.
1) The data was heavily imbalanced with 88% bias towards benign class (Type=0) and only 12% samples had Type=1 or Malicious website.
2) Process involves Exploration of data, Data Cleaning, Resampling of data (to handle highly imbalanced data), Model implementation and Evaluation.
3) Catboost classifier turned out to be the most robust model giving us approximately the appropriate values for Precision, Recall and F-1 score.
Challenges we ran into
1) Since the data was heavily imbalanced, I had to use an efficient resampling method. 2) To remove high multicollinearity, I created a custom function which removes all those columns which have multicollinearity with other columns 3) To find out the features on which the target value depended the most, I used Mutual Information gain.
Accomplishments that we're proud of
1) Efficiently pre-processed the data (includes handling high imbalance of data, dealing with multicollinearity, handling missing values etc.) 2) Built a model with a high F-1 score and high recall score (since we focused on correctly detecting the minority class and also tried to minimize the False Negative Rate) 3) Pointed out the most relevant features for target variables on the basis of a robust statistical metric of "Mutual Information Gain"
What we learned
1) Extensive hyper-parameter tuning 2) Extensive in-depth understanding of Catboost classification algorithm 3) In-depth understanding of Mutual Information Gain
What's next for Malicious vs Benign website - CatBoost classification
Deployment:
Storing all the relevant function in a .pkl format and then creating a web-interface to deploy the model using an API like Heroku or cloud platform like AWS etc..
Containerizing the application to make it robust
Built With
- catboost
- csv
- imblearn
- matplotlib
- numpy
- pandas
- pipeline
- python
- scikit-learn
- seaborn
Log in or sign up for Devpost to join the conversation.