A Python web scraping tool that extracts comprehensive statistics from GitHub user profiles using Selenium WebDriver.
- 📊 Basic Profile Info: Extract name and username from GitHub profiles
- 🔗 Social Links: Scrape social media links from user profiles
- 📈 Repository Analysis: Get total repositories, stars, and commits
- 💻 Language Detection: Find programming languages used across repositories
- 👥 Social Stats: Followers, following count and ratio calculation
- 🤖 Headless Browsing: Automated Chrome browser in headless mode
- Python 3.6+
- Chrome browser installed
- ChromeDriver (managed by selenium)
-
Install required packages:
pip install selenium
-
Ensure Chrome is installed on your system
Run the script:
python github_scraper.pyEnter the GitHub username when prompted:
PASTE YOUR ACCOUNT'S URL : username
The script contains three main classes:
Handles basic profile information extraction.
Methods:
name()- Extracts full name and usernamesocials()- Gets social media links from profilerepo()- Collects all repository URLs
Inherits from BasicInf0 and analyzes repository details.
Methods:
no_of_stars()- Counts stars for current repositoryno_of_commits()- Extracts commit count using regexno_of_languages()- Identifies programming languagesall_repo_insider()- Iterates through all repos for analysis
Handles follower/following statistics and ratios.
Methods:
followers_and_follwing()- Extracts follower/following countsno_of_repos()- Displays total repository countfollower_to_follwing_ratio()- Calculates and prints ratio
| Data Type | CSS Selector Used | Description |
|---|---|---|
| Name | span.p-name.vcard-fullname.d-block.overflow-hidden |
Full name |
| Username | span.p-nickname.vcard-username.d-block |
GitHub username |
| Social Links | li.vcard-detail.pt-1 > a |
Social media URLs |
| Repository Names | h3.wb-break-all > a |
Repo names and links |
| Repository Descriptions | div.col-10.col-lg-9.d-inline-block > div:nth-child(2) |
Repo descriptions |
| Stars | a.Link.Link--muted > strong |
Star count per repo |
| Commits | span.fgColor-default |
Commit count per repo |
| Languages | span.color-fg-default.text-bold.mr-1 |
Programming languages |
| Social Stats | div.mb-3 > a > span |
Followers/following |
John Doe @johndoe
Twitter : https://twitter.com/johndoe
LinkedIn : https://linkedin.com/in/johndoe
{'Python', 'JavaScript', 'HTML', 'CSS', 'Java'}
Total stars : 150
Total commits : 1250
5
1.75
- Initialize: Creates headless Chrome browser instance
- Get URL: Prompts user for GitHub username
- Basic Info: Scrapes name, username, and social links
- Repository Collection: Navigates to repositories tab and collects all repo URLs
- Repository Analysis:
- Visits each repository individually
- Extracts stars, commits, and languages
- Accumulates totals across all repositories
- Social Analysis: Calculates follower-to-following ratio
- Display Results: Prints all collected statistics
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op)# Collects repository URLs for later analysis
self.to_store_repos.append(ele1.get_attribute('href'))# Extracts numbers from commit text
numbers = re.findall(r'\d+', commits.text)
commit_list = [int(nums) for nums in numbers]# Uses set to avoid duplicate languages
self.total_lang = set()
self.total_lang.add(a.text)- No Error Handling: Original code lacks try-catch blocks
- Fixed Selectors: CSS selectors may break if GitHub updates their layout
- No Rate Limiting: May overwhelm GitHub's servers with rapid requests
- Input Handling: Limited validation of user input
- Resource Management: Browser instance not properly closed
- Element Not Found: CSS selectors may fail on different profile layouts
- Dynamic Content: Some elements may not load immediately
- Private Repositories: Cannot access private repo data
- Rate Limiting: GitHub may block excessive requests
- The script navigates between different GitHub pages automatically
- Uses headless Chrome to avoid opening browser windows
- Processes all repositories sequentially which may take time for users with many repos
- Accumulates statistics across all public repositories
- Only accesses publicly available GitHub data
- Respects GitHub's public profile information
- Use responsibly and respect GitHub's terms of service
- Consider using GitHub's official API for production use
- Add error handling and exception management
- Implement delays between requests
- Add input validation
- Proper browser cleanup
- Export results to files
- Progress indicators for long operations
selenium
re (built-in)
Make sure ChromeDriver is compatible with your installed Chrome version.