Inspiration

Vulnerabilities are a huge threat in the cybersecurity domain, leading to severe data breaches, system compromises, and financial loss. Specifically, the programming languages C and C++ constitute over 50% of vulnerabilities in software world wide. We believed this was a huge problem, so in order to tackle it and make software a safe place for people, we decided to build a fully integrated AI Vulnerability Code Analyzer.

What it does

Utilizing a machine learning model for analyzing possible vulnerabilities and then further filtering through Gemini’s API, we are able to provide a report on what files and lines in a repository's code could be a vulnerability. This is all integrated through a 6-state pipeline from front-end to back-end, ultimately providing the user with a seamless and informative experience.

How we built it

This is a 6-stage pipeline from front-end to back-end. The front-end is coded utilizing HTML/CSS and Javascript, with the landing page showcasing an outdoor environment, goose congo line, and geese on the left and right advocating for the use of the product.

The front-end and back-end communicate utilizing Python’s FastAPI. On an initial scan, the user enters a GitHub repository that is then sanitized by the front-end. After all sanitization is done, the URL gets passed to the back-end and the front-end changes to the loading screen. During the loading screen, the front-end keeps pinging a status file which is dynamically updated by the back-end. The status file contains context for what the back-end is currently doing, such as scanning snippets or chunks, allowing the user to gain a view into how the analysis is happening.

Within the back-end, the URL is then sent to a machine learning model that scans for possible vulnerabilities. This filtered information is then sent to Gemini’s API which then scans for what it deems “actual” vulnerabilities. Once the API call is finished, the finished information is then dumped into a JSON file for the front-end to read. Here, the status file is updated to “complete” and the front-end then does a GET request for the JSON file.

This JSON file is then read, and displayed to the user as the vulnerability report. Thereby allowing the user to perform further analysis. A back button is included for convenience where the user can repeat the process and make a new scan.

Challenges we ran into

When providing context windows, we ran into a problem with the chunking, because the window size was too small and was not aligning with the functions. This made code reading much more difficult and analysis accuracy to decrease. Also, when training the machine learning model, we were getting low accuracy numbers and the data split was off due to a large bias towards code with vulnerabilities. Additionally, the integration between the frontend, backend, and Vultr virtual machine was resulting in multiple issues because the API endpoints were not communicating correctly and the pipeline was not running smoothly.

Accomplishments that we're proud of

We are proud of how the model is able to successfully reduce the search window so the Gemini API doesn’t get strained off of too much context. Furthermore, we love how the website looks! We never expected to get all these features in the time-frame we had, but through our tenacity we were able to get them done. I think one of the neatest features that we love is how detailed the code analysis is. Providing the user with both the code highlighted, the type of vulnerability, as well as related, official NVD codes for each error is something we believe would be very helpful in verifying and fixing potential vulnerabilities in a given repository. Combine this with the pdf download feature, and now the user can share this report with anybody!

What we learned

Throughout this process, we learned about development, AI, and so much more. We gained a deeper understanding into machine learning model development, frontend and backend integration, as well as VM communication. We all believe these are skills we can expand on and use to build multiple projects in the future that have a positive impact on society.

What's next for HarnoldsEye

In the future, HarnoldsEye would like to include pathing and branch detection. While our code does a “mini-simulation” of the code, we would like to include varied/more fleshed out test cases in the future so the user can see how a malicious intent could exploit the flaw. We would also like to continuously improve the Machine Learning and AI model, improving its accuracy as well as providing it memory to learn from previous scans. We would also like to provide scans to private repositories as well, branching out the product for everybody to use!

Share this project:

Updates