A production-ready Machine Learning pipeline and interactive Streamlit Dashboard tailored for HR Attrition analytics, emphasizing Explainable AI (XAI) with LIME and SHAP.
- Interactive HR Dashboard: A beautiful, dark-mode native Python Streamlit application (
app.py) for live risk analysis. - Dual Explainability Engines:
- Local Risk (LIME): Native Matplotlib rendering of single-employee turnover factors.
- Macro Trends (SHAP): Aggregated
TreeExplainervisualizations mapping company-wide impacts.
- Advanced ML Pipeline:
SimpleImputerautomated missing value handling.imbalanced-learnSMOTE for synthetic minority over-sampling (addressing the 84% retention class imbalance).- Embedded
GridSearchCVroutines to automatically harvest the best Random Forest & XGBoost hyperparameter permutations for Precision-Recall AUC optimization.
Ensure you are running Python 3.10+ and install the dependencies:
pip install -r requirements.txtThe dashboard handles model training, grid-search tuning, and caching automatically on the first run.
streamlit run app.py├── data/
│ └── WA_Fn-UseC_-HR-Employee-Attrition.csv # IBM HR Dataset
├── notebooks/
│ ├── 01_lime_attrition_API.ipynb # API Playground
│ ├── 02_lime_attrition_example.ipynb # EDA and Model Compare Notebook
│ └── 03_lime_attrition_example_v2.ipynb
├── app.py # Core Streamlit Dashboard UI
├── lime_attrition_utils.py # The comprehensive ML Wrapper API
├── requirements.txt
├── .gitignore
└── README.md
(Note: artifacts.joblib and trailing .mov files are ignored by Git).
lime_attrition_utils.py abstracts away the boilerplate of scikit-learn models natively onto the LIME and SHAP engines.
import lime_attrition_utils as utils
# Automatically loads, cleans, and splits data (handling target string matching)
config = utils.AttritionDataConfig()
raw_df, X, y, X_train, X_test, y_train, y_test, config = utils.load_and_prep_data()
# Defines a pipeline with SMOTE, Imputers, and GridSearchCV automatically tuning Random Forest
preprocessor = utils.build_preprocessor(X_train)
model_config = utils.ModelConfig(use_random_forest=True)
param_grids = {
"random_forest": {"model__n_estimators": [100, 300], "model__max_depth": [3, 5]}
}
trained_models = utils.tune_and_train_models(X_train, y_train, preprocessor, model_config, param_grids)- Sparse Matrix Crashes:
scikit-learnOneHotEncodercreates sparse arrays by default. Since LIME crashes on sparse matrices, the custom wrapper strictly enforcessparse_output=Falsethroughout the pipeline. - Unseen Categoricals: Handled via
handle_unknown="ignore"during cross-validation tuning. - Heavy Native LIME execution: If LIME builds its own HTML D3.js visualization, the payload approaches >1MB per call. Our UI overrides this using
exp.as_pyplot_figure()mapped cleanly onto Streamlit.