PhishingRod

Inspiration

We were heavily interested in the intersection of the math behind LLMs and their applications in cyber security. The possibility of using finetuned LLMs to quickly detect common phishing attacks felt like a unique approach to the problem since most current solutions use ml instead.

How we built it

We fine-tuned an LLM using LoRA or a low-rank adaptation model to be more suitable for detecting phishing attempts. By integrating loRA and LLM, we wanted to focus on identifying the patterns that are used in email communication. We fine-tuned a model called BERT using a dataset of 500k data points. Using loRA allowed us to simplify the dataset's complexity while retaining the efficiency of processing such vast amounts of information and its properties.

Challenges we ran into

Multiple issues with imports and installing dependencies in Python. We also struggled to get LoRA to work with a specific binary-text classification. The dataset we used originally was also more suitable for ML applications with weights than for parsing through text with tokenization. We also encountered issues with conversion to a possible web dataset. The final problem was that our dataset was too large to be completely trained within the

Accomplishments that we're proud of

Given the complex nature of the problem of being able to detect phishing attempts on social media and the web, I am proud that our team approached the problem in a structured manner. We loaded the large dataset, cleaned the data, and fine-tuned the BERT model. Our team brainstormed an innovative idea to integrate loRA with a HuggingFace LLM (BERT) as a phishing detection method. This was the right strategy when dealing with such a large data set as it allows to simplify the information. Cybersecurity was a top priority during the entire design process.

What we learned

In the PhishingRod project, our team harnessed linear algebra and AI synergy, fine-tuning LLMs with LoRA for precise phishing detection. Confronting large datasets and intricate cybersecurity challenges, we navigated technical complexities, emphasizing a meticulous, security-first approach. Each step reinforced the value of innovative, structured AI solutions in cybersecurity, marking our journey as a collaborative foray into the nuanced domain of digital security.

What's next for PhishingRod

We have a lot in store for phishing rods! Firstly, we would finish training our model and do a quick deployment using a chome extension or a streamlit. However, in the future, we would want to create a custom secure dataset and use a secured server with the help of Sandia National Labs to make better prevention and prevent malicious injection attacks. We would love to convert this dataset to a Web Dataset format to increase the efficacy of our training runtime, especially considering the large amount of data we were working with. If we use the secure cloud, another benefit is that we could use our dataset we can use a virtual machine or a container and make it a lot more secure. We would also love to make the code fully open source so lots of developers that create custom deployments and integrate the model into their own builds so we have more.

Built With

bert
hugginface
llm
numpy
pandas
python
pytorch
sklearn

Updates

Aditya Gollamudi started this project — Jan 28, 2024 01:00 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.