Inspiration
Inspired by the frequent supply chain compromises in Python packages, this tool was developed to proactively detect suspicious changes in code commits. Leveraging advanced transformer models and clustering techniques, the project aims to safeguard the integrity of open-source software and prevent security breaches before they occur.
What it does
The tool automatically collects GitHub commit data from Python repositories, then uses transformer-based NLP and clustering techniques to analyze commit messages and code diffs. It combines these results with metadata (such as author activity), and some static detentions to generate a risk score for each commit, flagging those that may indicate supply chain compromises. This enables organizations to identify and remediate suspicious changes early, protecting the integrity of their open-source software.
How we built it
Data Collection:
- This tool gathers commit data from a specified GitHub repository using the PyGithub library. It extracts commit messages, code diffs (file patches), and author information over a configurable time period.
Commit Message Analysis:
- Text Embeddings: Commit messages (augmented with file patches) are converted into dense vector representations using Sentence Transformers (all-MiniLM-L6-v2 model).
- Clustering: DBSCAN groups similar commit messages and helps flag outliers that might indicate anomalous behavior.
- Zero-Shot Classification: The tool uses the facebook/bart-large-mnli model to evaluate the commit text against candidate labels (suspicious, malicious, benign, normal), producing a risk score for each commit message.
Code Diff Analysis:
- Embedding Generation: For each commit's code diff, CodeBERT is used to generate embeddings that capture the semantic meaning of the code changes.
- Anomaly Detection: An Isolation Forest identifies unusual patterns in these embeddings, while a set of regex-based rules detects risky code patterns (e.g., requests.post, eval, etc.).
- Zero-Shot Classification for Code: The tool also classifies code diffs as "safe," "suspicious," or "malicious" using a zero-shot approach
Ensemble Integration:
- The outputs from the commit message analysis, code diff analysis, and metadata (derived from author activity) are combined using a weighted sum to compute a final risk score.
- This final score classifies each commit as "High Risk" or "Normal," flagging potential supply chain compromises.
Deployment:
- The entire pipeline is written in Python with an option for a container using Docker. A Dockerfile file is provided so that the tool can run on both Linux and Windows (using Linux containers), ensuring a consistent environment for all users.
Challenges we ran into
The biggest challenge was making the project work across platforms, especially on Windows due to Python dependencies. To solve this a Docker option was added. However, currently the docker option significantly impacts performance if the docker image doesn't have enough resources.
Accomplishments that we're proud of
Building a multi-phase tool that leverages transformer models and clustering to flag suspicious commits in Python repositories. Overcoming cross-platform challenges by using docker to ensure smooth performance on both Windows and Linux.
What we learned
learned how to integrate modern NLP and clustering techniques to detect potential supply chain compromises, gained valuable experience containerizing the solution with Docker, and embraced the challenges of a first hackathon.
What's next for GitHub Supply Chain Analysis Tool
Future enhancements will focus on expanding support beyond Python to additional programming languages and integrating with Azure DevOps, ensuring broader compatibility and improved functionality across platforms.
Built With
- all-minilm-l6-v2
- codebert
- docker
- facebook/bart-large-mnli
- github-api
- huggingface
- python
- scikit-learn

Log in or sign up for Devpost to join the conversation.