A comprehensive data analytics application that analyzes GitHub issues for the poetry Open Source project to generate actionable insights for project maintainers and contributors.
This application provides three powerful analysis features:
- Label Resolution Time Analysis & Prediction (Feature 1) - Machine learning model that predicts issue resolution time based on labels and historical patterns
- Contributors Dashboard (Feature 2) - Comprehensive contributor behavior analysis with 7 interactive visualizations tracking engagement, lifecycle stages, and community health
- Priority & Complexity Prediction (Feature 3) - ML-based classification system that separates business urgency from technical complexity for intelligent issue triage
data_loader.py: Loads GitHub issues from JSON data files into runtime data structuresmodel.py: Implements data models and machine learning models for issue analysisconfig.py: Manages application configuration viaconfig.jsonfilerun.py: Main entry point that orchestrates feature execution based on command-line parameters
git clone https://github.com/akashsv01/project-application-template.git
cd project-application-templateCreate a virtual environment, activate it, and install required packages:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt- Download the
poetry_issues_all.jsondata file from the course assignment - Place it in the
data/directory - Update
config.jsonwith the paths to your data and output files:
{
"ENPM611_PROJECT_DATA_PATH": "data/poetry_issues_all.json",
"ENPM611_PROJECT_OUTPUT_PATH": "output/"
}Test your setup by running the example analysis:
python run.py --feature 0This outputs basic information about the issues to the command line.
To make the application easier to debug, runtime configurations are provided to run each of the analyses you are implementing. When you click on the run button in the left-hand side toolbar, you can select to run one of the three analyses or run the file you are currently viewing. That makes debugging a little easier. This run configuration is specified in the .vscode/launch.json if you want to modify it.
The .vscode/settings.json also customizes the VSCode user interface sligthly to make navigation and debugging easier. But that is a matter of preference and can be turned off by removing the appropriate settings.
Feature 1 is basically predicting the approximate time to complete the open issues based on Machine Learning model which was trained on closed issues. Different features were used to train the model. Feature 1 is an analysis that takes input label from user.
Run below code to get analysis of feature 1:
python run.py --feature 1 --label kind/bug
Overall Statistics:
โข Total closed issues analyzed: 5033
โข Unique labels found: 54
โข Overall median resolution: 9.69 days
โข Overall average resolution: 162.31 days
status/invalid - 0.04 days (n=22)
area/project/deps - 0.10 days (n=6)
kind/question - 0.20 days (n=263)
area/distribution - 0.21 days (n=1)
version/1.2.0 - 0.32 days (n=2)
status/duplicate - 0.41 days (n=318)
area/docs/faq - 1.13 days (n=29)
status/triage - 1.47 days (n=790)
status/external-issue - 1.56 days (n=143)
area/show - 4.44 days (n=1)
status/accepted - 788.46 days (n=3)
area/error-handling - 692.22 days (n=35)
area/ux - 503.40 days (n=32)
status/needs-consensus - 502.53 days (n=4)
area/publishing - 375.96 days (n=17)
status/wontfix - 358.98 days (n=10)
kind/enhancement - 323.88 days (n=30)
area/plugin-api - 323.20 days (n=8)
status/needs-reproduction - 290.73 days (n=59)
good first issue - 269.60 days (n=13)
-
month: 0.341
-
day_of_week: 0.304
-
num_labels: 0.134
-
has_feature_label: 0.083
-
has_area_label: 0.068
โข Issue #9183: 0.8 days Labels: area/docs, status/triage
โข Issue #9146: 4.4 days Labels: area/docs, status/triage
โข Issue #7643: 21.5 days Labels: kind/bug, status/triage, area/windows
โข Issue #7610: 21.5 days Labels: kind/bug, area/installer, status/triage
โข Issue #9644: 25.7 days Labels: area/docs
Different types of graphs and analysis are done based on the prediction time to complete the open issues.
โข output/label_resolution_analysis.json - Complete analysis results
โข output/label_statistics.json - Label-wise statistics
โข output/open_issue_predictions.json - Predictions for open issues
โข output/visualizations/ - All generated graphs
Feature 2 provides comprehensive contributor behavior analysis with 7 interactive visualizations that reveal engagement patterns, community health metrics, and temporal activity trends.
Run the analysis:
python run.py --feature 2This feature analyzes contributor patterns across multiple dimensions to provide actionable insights for project maintainers and community managers.
What it shows: Yearly distribution of bug closures, comparing contributions from the top 5 bug fixers every year versus the broader community.
Why it matters:
- Identifies concentration of bug-fixing responsibility.
- Reveals potential bus factor risks (over-reliance on few contributors).
- Highlights years with strong community participation vs. maintainer-heavy periods.
What it shows: Top 10 contributors ranked by number of feature requests, with stacked bars showing open vs. closed requests.
Why it matters:
- Highlights power users driving feature roadmap discussions.
- Shows which contributors' requests are being prioritized.
What it shows: Monthly counts of open vs. closed documentation issues (bar chart) with average number of unique commenters per doc issue (line overlay).
Why it matters:
- Documentation quality directly impacts project accessibility and adoption.
- High commenter counts indicate confusion or gaps in documentation.
- Growing open issues suggest documentation debt accumulation.
- Helps prioritize documentation sprints and improvements.
What it shows: Top 40 contributors ranked by total number of issues created.
Why it matters:
- Identifies active community contributors.
- Recognizes engaged users who are thoroughly testing and reporting.
What it shows: Interactive Plotly chart with yearly rankings of the top 10 most active contributors per year. Activity means the total number of issues created, closed and commented.
Why it matters:
- Highlights sustained engagement and contributor retention.
- Identifies emerging core contributors.
- Shows how the contributor base evolves as project matures.
- Helps recognize long-term community members for maintainer roles.
What it shows: 2D heatmap showing contributor activity across days of week and hours of day, with color intensity representing activity volume.
Why it matters:
- Optimal timing for community events, release announcements, or live Q&A sessions.
- Understanding global contributor distribution (timezone patterns).
- Scheduling maintainer availability during high-activity periods.
- Planning automated processes during low-activity hours.
Sample CLI Output:
=== Overall Busiest Hours (across all days) ===
Hour 15: 7.19% (average share of a day's activity)
Hour 16: 6.62% (average share of a day's activity)
Hour 18: 5.90% (average share of a day's activity)
Hour 14: 5.83% (average share of a day's activity)
Hour 17: 5.80% (average share of a day's activity)
=== Top 3 Busy Hours Per Day ===
Mon:
Hour 16 โ 7.06% of Mon's activity
Hour 15 โ 7.01% of Mon's activity
Hour 14 โ 6.73% of Mon's activity
Total (Top 3) โ 20.80% of Mon's activity
...
What it shows: Bar chart classifying contributors into four lifecycle stages:
- ๐Newcomer: First activity within last 30 days
- ๐ง Core Maintainer: Sustained engagement for over 1 year
- ๐ค๏ธGraduated Contributor: Inactive for 6+ months
- โกActive: Regular contributors
Why it matters:
- High-level view of community health and contributor pipeline.
- Identifies retention issues if many contributors are "graduating".
- Shows whether project is attracting new contributors.
- Helps plan mentorship programs for newcomers.
Feature 3 uses machine learning to predict both the priority and complexity of open issues. Unlike simple time-based predictions, this feature separates business urgency from technical complexity, providing actionable insights for project maintainers.
Run below code to get analysis of feature 3:
python run.py --feature 3
Key Capabilities:
โข Priority Classification: Categorizes issues as Critical/High/Medium/Low based on:
- Labels (bug, critical, security)
- Community engagement (comments, participants)
- Maintainer response time
- Historical resolution patterns
โข Complexity Scoring: Calculates technical complexity (0-100) based on:
- Code depth and length
- Technical indicators (stack traces, code blocks)
- Multiple component involvement
- Technical scope (architecture, refactoring, performance)
โข Independent Metrics: Priority and complexity are calculated separately, allowing identification of:
- ๐ด High Priority, Low Complexity: Simple urgent bugs
- ๐ก Low Priority, High Complexity: Technical refactors
- ๐ต High Priority, High Complexity: Critical architectural issues
- ๐ข Low Priority, Low Complexity: Minor fixes
Training Data:
- Total closed issues analyzed: 5,256
- Valid training samples: 5,256
- Open issues predicted: 317
Resolution Time Statistics:
- Median: 13.8 days
- Mean: 210.2 days
- 75th percentile: 261.2 days
- 95th percentile: 1003.3 days
Priority Distribution:
- Critical: ~5-8%
- High: ~15-20%
- Medium: ~35-40%
- Low: ~35-45%
Priority Classification:
- Overall Accuracy: 80%
- Top Feature: Number of comments (10.3% importance)
- Second Feature: Bug label (8.3% importance)
- Third Feature: Number of events (6.5% importance)
Top 5 Issues by Priority and Complexity:
-
[Medium] #9780 - Complexity: 75/100
- Unable to install PyTorch version 2.5.0 with CUDA 12.4
- Confidence: 89.0%
- Current activity: 1 comment
-
[Medium] #9682 - Complexity: 75/100
- Cannot install Monorepo deps without sourcecode for Dockerfile caching
- Confidence: 65.0%
- Current activity: 1 comment
-
[Medium] #9634 - Complexity: 75/100
- Poetry forgetting some dependencies (mix of extras, groups and version markers)
- Confidence: 76.5%
- Current activity: 6 comments
-
[Low] #5138 - Complexity: 75/100
- Poetry debugging with PyCharm not possible?
- Confidence: 67.0%
- Example of Low Priority but High Complexity
-
[Low] #9161 - Complexity: 5/100
- Add test coverage for tests/helpers.py
- Confidence: 78.5%
- Example of Low Priority and Low Complexity
โข output/priority_predictions.json - Complete priority and complexity predictions for all 317 open issues
JSON Output Format:
{
"predicted_priority": "Medium",
"priority_confidence": 89.0,
"complexity_score": 75,
"number": 9780,
"title": "Issue title...",
"url": "https://github.com/...",
"labels": ["kind/bug", "status/triage"],
"num_comments": 1
}- Triage Automation: Quickly identify which issues need immediate attention
- Resource Allocation: Match developers to issues based on complexity
- Sprint Planning: Balance high-priority items with complexity estimates
- Maintainer Insights: Understand which types of issues are most urgent vs. most complex
- Trend Analysis: Track how priority and complexity correlate over time
project-application-template/
โโโ analysis/
โ โโโ contributors_analyzer.py # Feature 2 analysis logic
โ โโโ priority_analyzer.py # Feature 3 analysis logic
โโโ controllers/
โ โโโ contributors_controller.py # Feature 2 controller
โ โโโ priority_controller.py # Feature 3 controller
โ โโโ label_resolution_controller.py # Feature 1 controller
โโโ visualization/
โ โโโ visualizer.py # Feature 2 visualizations
โ โโโ label_resolution_visualizer.py # Feature 1 visualizations
โโโ app/
โ โโโ feature_runner.py # Main feature orchestrator
โโโ data/
โ โโโ poetry_issues_all.json # Issue data
โโโ output/ # Generated outputs
โโโ model.py # Data models & ML models
โโโ data_loader.py # Data loading utilities
โโโ config.py # Configuration management
โโโ run.py # Application entry point
โโโ requirements.txt # Python dependencies
- Feature 1: Label Resolution Time Analysis - Neel Patel
- Feature 2: Contributors Dashboard - Akash S Vora
- Feature 3: Priority & Complexity Prediction - Subiksha Jegadish