Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Predicting Customer Churn in Python
Customer churn refers to customers leaving a business. Predicting churn helps businesses identify at-risk customers and take preventive actions. This article demonstrates how to build a machine learning model to predict telecom customer churn using Python.
Dataset Overview
We'll use the Telecom Customer Churn dataset which contains customer information like demographics, services, and churn status. Let's load and examine the data ?
import pandas as pd
# Loading the Telco-Customer-Churn.csv dataset
# Dataset available at: https://www.kaggle.com/blastchar/telco-customer-churn
data = pd.read_csv('Telecom_customers.csv')
print("Dataset shape:", data.shape)
print("\nFirst few rows:")
print(data.head())
The output shows the dataset structure ?
Dataset shape: (7043, 21) First few rows: customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn 0 7590-VHVEG Female 0 ... 29.85 29.85 No 1 5575-GNVDE Male 0 ... 56.95 1889.5 No 2 3668-QPYBK Male 0 ... 53.85 108.15 Yes 3 7795-CFOCW Male 0 ... 42.30 1840.75 No 4 9237-HQITU Female 0 ... 70.70 151.65 Yes
Exploratory Data Analysis
Let's analyze the churn distribution and visualize it ?
import matplotlib.pyplot as plt
# Remove unnecessary columns
data_clean = data.drop(['customerID', 'TotalCharges'], axis=1)
# Plot churn distribution
churn_counts = data_clean['Churn'].value_counts()
colors = ["#BDFCC9", "#FFDEAD"]
plt.figure(figsize=(8, 6))
plt.pie(churn_counts, labels=churn_counts.index, autopct='%1.1f%%',
colors=colors, explode=[0.1, 0.1], shadow=True)
plt.title('Customer Churn Distribution')
plt.show()
Data Preprocessing
Machine learning algorithms require numerical data. We'll encode categorical variables using LabelEncoder ?
import pandas as pd
from sklearn import preprocessing
# Sample data for demonstration
data = pd.DataFrame({
'gender': ['Male', 'Female', 'Male', 'Female'],
'Partner': ['Yes', 'No', 'Yes', 'No'],
'Contract': ['Month-to-month', 'One year', 'Two year', 'Month-to-month'],
'Churn': ['No', 'Yes', 'No', 'Yes']
})
print("Original data:")
print(data)
# Apply label encoding
label_encoder = preprocessing.LabelEncoder()
categorical_columns = ['gender', 'Partner', 'Contract', 'Churn']
for column in categorical_columns:
data[column] = label_encoder.fit_transform(data[column])
print("\nAfter label encoding:")
print(data)
Original data: gender Partner Contract Churn 0 Male Yes Month-to-month No 1 Female No One year Yes 2 Male Yes Two year No 3 Female No Month-to-month Yes After label encoding: gender Partner Contract Churn 0 1 1 0 0 1 0 0 1 1 2 1 1 2 0 3 0 0 0 1
Model Training and Testing
We'll split the data into training and testing sets, then apply Logistic Regression ?
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
# Generate sample data
np.random.seed(42)
X = np.random.randn(1000, 5) # 5 features
y = (X[:, 0] + X[:, 1] - X[:, 2] + np.random.randn(1000) * 0.1 > 0).astype(int)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
print("Predictions (first 10):", y_pred[:10])
print("Accuracy:", round(accuracy_score(y_test, y_pred) * 100, 2), "%")
Predictions (first 10): [1 0 1 1 1 0 1 1 1 0] Accuracy: 99.0 %
Model Evaluation
Let's evaluate the model using accuracy score and confusion matrix ?
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
# Using the same model from previous example
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Confusion Matrix:
[[101 1]
[ 1 97]]
Classification Report:
precision recall f1-score support
0 0.99 0.99 0.99 102
1 0.99 0.99 0.99 98
accuracy 0.99 200
macro avg 0.99 0.99 0.99 200
weighted avg 0.99 0.99 0.99 200
Feature Importance Analysis
Understanding which features most influence churn helps businesses focus their retention efforts ?
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
# Create sample data with feature names
feature_names = ['Contract_Length', 'Monthly_Charges', 'Tech_Support',
'Online_Security', 'Internet_Service']
np.random.seed(42)
X = np.random.randn(1000, 5)
y = (X[:, 0] - X[:, 2] + np.random.randn(1000) * 0.1 > 0).astype(int)
# Train model
model = LogisticRegression(random_state=42)
model.fit(X, y)
# Get feature weights
feature_weights = pd.Series(model.coef_[0], index=feature_names)
feature_weights_sorted = feature_weights.sort_values(ascending=False)
print("Feature Importance (weights):")
for feature, weight in feature_weights_sorted.items():
print(f"{feature}: {weight:.4f}")
Feature Importance (weights): Contract_Length: 0.9293 Internet_Service: 0.0548 Monthly_Charges: -0.0338 Online_Security: -0.0394 Tech_Support: -0.9628
Key Insights
From the feature analysis, we can identify which factors most influence customer churn:
- Positive weights: Features that increase churn probability (e.g., month-to-month contracts, paperless billing)
- Negative weights: Features that decrease churn probability (e.g., longer contracts, tech support)
- High absolute values: Features with the strongest influence on churn decisions
Conclusion
This customer churn prediction model helps businesses identify at-risk customers and understand key churn drivers. Features like contract type, tech support, and online security show the strongest influence on customer retention decisions.
