Machine Learning to Detect Phishing Websites

Inspiration

I have been studying various datasets in network and internet security. I found a dataset from Neda Abdelhamid which was based on Associative Classification Data Mining which I thought would be a valuable AI plugin for 5G applications to help protect users. Clearly, safe computing and protecting against phishing is a valuable function, and associate classification has been shown to be an effective method.

What it does

This microservice takes the parameters of a website in as an input, and will determine if the website is a legitimate website, if it is suspicious, or if it is probably a phishing website. This could be implemented on servers routing users based on searches, and could protect these users by detecting phishing websites before they are reported.

I had been working on a deeper front-end to examine the URL website parameters to automate the delivery of these URL associated parameters, however, I have exposed the raw data input for this entry, in case chaining of AI classifications is eventually built into various acumos classifiers.

How I built it

The project was built using the machine learning algorithms in scikit-learn, and the data was retrieved from http://archive.ics.uci.edu/ml/datasets/Website+Phishing#

It is important to note that this database had 1353 examples of websites, and there was a pretty even distribution of phishing, and normal sites. I manipulated the data until it could fit with scikit-learn algorithms, and trained a model based on the data.

Challenges I ran into

Originally, I created a model that could re-route IP and adhoc networks to a higher efficiency that is currently available. I spent a large amount of time on this project, and it was successful in reducing the load on mobile ad-hoc networks. Unfortunately, acumos was limited to input and output data, and my model relied on custom training data for the user (namely signal dead spots for the AI to map around), and it seemed like acumos wanted pretrained models as microservices. It was unfortunate, but I began to look into datasets that could be turned into pretrained models. I found a dataset for phishing websites, and thought that network security would be a cool application for an AI microservice.

My first approach was to use a wrapper on Tensorflow called tflearn. Tflearn is really amazing, as it allows users to create highly complex and advanced networks in a reduced number of lines of code than if it was written in Tensorflow. I created numerous tflearn models that used LSTM layers, and convolutional layer for things like sentiment analysis, natural language processing, and image recognition. Unfortunately, Tflearn saves graph.meta data from tensorflow to a place where acumos could not find it, and I came to the conclusion that Tflearn was not compatible with acumos at this time. Therefore, I had to scrap those projects for acumos.

Then I realized, Tensorflow models did not seem to work. Originally, I wrote pretrained models and tried to have the acumos client find the .tf files and meta data files of the network, but that did not seem to be within the functionality of acumos (I posted a note of the issue to StackOverFlow and confirmed this). Acumos intended Tensorflow to be trained and uploaded in the same program it seems. However, I found that Tensorflow, upon training, relies on data that does not get onboarded correctly. I attempted to onboard Tensorflow models from mac, windows, and Linux, and was ultimately unsuccessful. The .tar file which would load the docker to run the container of the microservice failed because it could not find the graph.meta of the tensorflow model (probably because it got lost upon onboarding with tensorflow). I hypothesized that it could also have to do with windows and mac/Linux directory paths not being compatible (forward slash vs back slash).

I decided to go with scikit-Learn. It seemed to be the toolkit used by the examples on the marketplace already, so I figured it must work (I didn't find any indication that a model was ever successfully onboarded and deployed with tensorflow despite the examples provided). I had never used this tool before, but it was pretty easy to learn, and once I trained it on the car data, I was able to onboard it successfully.

However, here is where things got really tricky. Once it was onboarded successfully, I downloaded the .tar file and ran it in docker. It worked (something I had only dreamed about), but when I attempted to make HTTP post requests to the microservice running locally on my machine, it didn't work. It turns out, that the python acumos platform has a bug. Basically, the HTTP Post wants to send bytes of data to the microservice, and does not want to send a string of data. It sends the data using hexadecimal encoding for each byte, and each data point sent to the service has an index. I was sending 5 data points, and the index in python was '\x20' which python maps to '\n', which is not what the proto3 was looking for. Python does allow for hard coding bytes like 'x20', but anytime it gets treated like a string (for appending, splicing, etc...) it gets converted to its ascii equivalent. I had to develop a work around for this issue, but it was challenging and took a lot of time.

Ultimately, I look forward to when Acumos can clean up these bugs and allow for more complex projects, because I think the idea behind it is actually pretty cool, and they are almost there with it.

Accomplishments that I'm proud of

Cybersecurity and safe computing is very important to me. Many computers and internet connected devices have been harmed with malmare and spoofed sites. I hope to help the computing environment with intelligent agents to eliminate this evil. I am hopeful this AI based microservice will initiate the focus to help drive safe computing, and am proud to be part of the effort to provide a safe environment.

What I learned

I learned how to use cloud based docker, and run programs with containers. I read multiple papers on phishing and malmare, and found an interesting associate classification means to provide a safe computing environment, which I was able to turn into a cloud application. Finally, I learned how to develop microservices for Acumos, so I can potentially submit more services to improve our 5G computing environment.

What's next for Machine Learning to Detect Phishing Websites

Network security is a really interesting niche for AI technologies. I believe that the algorithm can be improved if more data is used, and I think that attempting different model types might be beneficial. I hope that my algorithms might eventually be used to keep everyone safe as they surf the web.

Built With

Updates

Andrew James started this project — Aug 05, 2018 08:49 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.