This tool analyzes GitHub repositories for potential supply chain compromises using a multi-phase pipeline.
Note: The analysis is currently optimized for Python repositories only.
-
Commit Data Collection:
Retrieves commit data (including commit messages, file diffs, and author information) from a specified GitHub repository. -
Commit Message Analysis:
Uses transformer models (e.g., facebook/bart-large-mnli, all-MiniLM-L6-v2) and clustering to score commit messages for suspicious content. -
Code Diff Analysis:
Generates code embeddings with CodeBERT and applies anomaly detection (Isolation Forest), rule-based risky pattern matching, and zero-shot classification to analyze code changes. -
Ensemble Integration:
Combines commit message risk, code diff risk, and metadata (author-based risk) into a final risk score for each commit. -
Dockerized Deployment:
The tool is containerized using Docker for easy deployment on Linux and Windows (using Linux containers). The Dockerfile pre-downloads required models during the build process to avoid repeated downloads at runtime.
- Clone the Repository:
git clone https://github.com/mwilson877/repo-analysis cd repo-analysis - Install Python Dependencies:
pip install -r requirements.txt
- Clone the Repository:
git clone https://github.com/mwilson877/repo-analysis cd repo-analysis - Build the Docker Image:
docker build -t github-analysis-tool .
Run the main script with the required arguments:
python github_analysis.py --repo <owner/repo> [--api-key YOUR_API_KEY] [-d, --days number] [-v, --verbose] [-w, --write] [-h, --help]--repo: (Required) GitHub repository in the formatowner/repo(e.g.,psf/requests).--api-key: (Optional) GitHub API key for increased rate limits.-d, --days: (Optional) Number of days to look back (default: 90).-v, --verbose: (Optional) Show full output (verbose mode).-w, --write: (Optional) Write JSON output files to disk.-h, --help: (Optional) Show help message and exit.
Example:
python github_analysis.py --repo psf/requests --days 120 -vRun the main script with the required arguments:
docker run github-analysis-tool --repo psf/requests --days 120-
Performance:
Running transformer models on CPU can be slow. For better performance, consider using a GPU-enabled Docker setup. Alteranatively, consider giving your docker setup more CPU cores.
-
API Rate Limits:
If you run into GitHub API rate limits, use the
--api-keyargument with your github API key. The primary rate limit for unauthenticated requests is 60 requests per hour and 5,000 requests per hour for authenticated users.