Predicting Customer Churn in Python

Customer churn refers to customers leaving a business. Predicting churn helps businesses identify at-risk customers and take preventive actions. This article demonstrates how to build a machine learning model to predict telecom customer churn using Python.

Dataset Overview

We'll use the Telecom Customer Churn dataset which contains customer information like demographics, services, and churn status. Let's load and examine the data ?

import pandas as pd

# Loading the Telco-Customer-Churn.csv dataset
# Dataset available at: https://www.kaggle.com/blastchar/telco-customer-churn
data = pd.read_csv('Telecom_customers.csv')
print("Dataset shape:", data.shape)
print("\nFirst few rows:")
print(data.head())

The output shows the dataset structure ?

Dataset shape: (7043, 21)

First few rows:
   customerID  gender SeniorCitizen  ... MonthlyCharges TotalCharges Churn
0  7590-VHVEG  Female             0  ...          29.85        29.85    No
1  5575-GNVDE    Male             0  ...          56.95       1889.5    No
2  3668-QPYBK    Male             0  ...          53.85       108.15   Yes
3  7795-CFOCW    Male             0  ...          42.30      1840.75    No
4  9237-HQITU  Female             0  ...          70.70       151.65   Yes

Exploratory Data Analysis

Let's analyze the churn distribution and visualize it ?

import matplotlib.pyplot as plt

# Remove unnecessary columns
data_clean = data.drop(['customerID', 'TotalCharges'], axis=1)

# Plot churn distribution
churn_counts = data_clean['Churn'].value_counts()
colors = ["#BDFCC9", "#FFDEAD"]

plt.figure(figsize=(8, 6))
plt.pie(churn_counts, labels=churn_counts.index, autopct='%1.1f%%', 
        colors=colors, explode=[0.1, 0.1], shadow=True)
plt.title('Customer Churn Distribution')
plt.show()

Data Preprocessing

Machine learning algorithms require numerical data. We'll encode categorical variables using LabelEncoder ?

import pandas as pd
from sklearn import preprocessing

# Sample data for demonstration
data = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female'],
    'Partner': ['Yes', 'No', 'Yes', 'No'],
    'Contract': ['Month-to-month', 'One year', 'Two year', 'Month-to-month'],
    'Churn': ['No', 'Yes', 'No', 'Yes']
})

print("Original data:")
print(data)

# Apply label encoding
label_encoder = preprocessing.LabelEncoder()
categorical_columns = ['gender', 'Partner', 'Contract', 'Churn']

for column in categorical_columns:
    data[column] = label_encoder.fit_transform(data[column])

print("\nAfter label encoding:")
print(data)
Original data:
  gender Partner        Contract Churn
0   Male     Yes  Month-to-month    No
1 Female      No        One year   Yes
2   Male     Yes        Two year    No
3 Female      No  Month-to-month   Yes

After label encoding:
   gender  Partner  Contract  Churn
0       1        1         0      0
1       0        0         1      1
2       1        1         2      0
3       0        0         0      1

Model Training and Testing

We'll split the data into training and testing sets, then apply Logistic Regression ?

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.randn(1000, 5)  # 5 features
y = (X[:, 0] + X[:, 1] - X[:, 2] + np.random.randn(1000) * 0.1 > 0).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

print("Predictions (first 10):", y_pred[:10])
print("Accuracy:", round(accuracy_score(y_test, y_pred) * 100, 2), "%")
Predictions (first 10): [1 0 1 1 1 0 1 1 1 0]
Accuracy: 99.0 %

Model Evaluation

Let's evaluate the model using accuracy score and confusion matrix ?

from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Using the same model from previous example
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Confusion Matrix:
[[101   1]
 [  1  97]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       102
           1       0.99      0.99      0.99        98

    accuracy                           0.99       200
   macro avg       0.99      0.99      0.99       200
weighted avg       0.99      0.99      0.99       200

Feature Importance Analysis

Understanding which features most influence churn helps businesses focus their retention efforts ?

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Create sample data with feature names
feature_names = ['Contract_Length', 'Monthly_Charges', 'Tech_Support', 
                'Online_Security', 'Internet_Service']
np.random.seed(42)
X = np.random.randn(1000, 5)
y = (X[:, 0] - X[:, 2] + np.random.randn(1000) * 0.1 > 0).astype(int)

# Train model
model = LogisticRegression(random_state=42)
model.fit(X, y)

# Get feature weights
feature_weights = pd.Series(model.coef_[0], index=feature_names)
feature_weights_sorted = feature_weights.sort_values(ascending=False)

print("Feature Importance (weights):")
for feature, weight in feature_weights_sorted.items():
    print(f"{feature}: {weight:.4f}")
Feature Importance (weights):
Contract_Length: 0.9293
Internet_Service: 0.0548
Monthly_Charges: -0.0338
Online_Security: -0.0394
Tech_Support: -0.9628

Key Insights

From the feature analysis, we can identify which factors most influence customer churn:

  • Positive weights: Features that increase churn probability (e.g., month-to-month contracts, paperless billing)
  • Negative weights: Features that decrease churn probability (e.g., longer contracts, tech support)
  • High absolute values: Features with the strongest influence on churn decisions

Conclusion

This customer churn prediction model helps businesses identify at-risk customers and understand key churn drivers. Features like contract type, tech support, and online security show the strongest influence on customer retention decisions.

Updated on: 2026-03-15T17:30:36+05:30

685 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements