Statsmodels Logistic Regression (Logit and Probit)

Sklearn’s LogisticRegression is great for pure prediction tasks, but when I want p-values, confidence intervals, and detailed statistical tests, I reach for Statsmodels instead.

The library gives you two main options for binary classification: Logit and Probit. Both model the probability that an observation belongs to class 1, but they use different link functions to transform the linear combination of predictors into probabilities between 0 and 1.

Statsmodel Beginner’s Learning Path

Why Statsmodels When Sklearn Exists?

Statsmodels comes from the statistics world, not the machine learning world. When I fit a model with statsmodels.api.Logit(), I get back an object that behaves like what you’d see in R or SAS. The output includes standard errors for each coefficient, z-scores, p-values, confidence intervals, and information criteria like AIC and BIC.

This matters when I’m doing actual statistical analysis rather than just building a classifier. If I need to say “this coefficient is statistically significantly different from zero at the 0.05 level,” or if I need to construct 95% confidence intervals for my odds ratios, Statsmodels gives me those tools directly.

Here’s the basic pattern I use:

import statsmodels.api as sm
import numpy as np
import pandas as pd

# Generate sample data
np.random.seed(42)
X = np.random.randn(1000, 3)
true_coef = np.array([0.5, -0.8, 1.2])
linear_combination = X @ true_coef
prob = 1 / (1 + np.exp(-linear_combination))
y = (np.random.random(1000) < prob).astype(int)

# Add constant term for intercept
X_with_const = sm.add_constant(X)

# Fit logistic regression
logit_model = sm.Logit(y, X_with_const)
result = logit_model.fit()

print(result.summary())

The result.summary() output looks like a regression table from a statistics textbook. Each row shows a predictor, its coefficient estimate, standard error, z-statistic, p-value, and confidence interval. This is fundamentally different from sklearn’s .coef_ attribute, which just gives you the coefficient values without any statistical context.

Logit vs Probit: The Link Function Decision

The Logit model uses the logistic function as its link: p = 1 / (1 + exp(-Xβ)). The Probit model uses the cumulative distribution function of the standard normal distribution: p = Φ(Xβ), where Φ is the standard normal CDF.

In practice, I’ve found that Logit and Probit give nearly identical predictions for most datasets. The real difference shows up in how you interpret the coefficients and in extreme probability regions.

The logistic function has heavier tails than the normal CDF. When your linear predictor Xβ is very negative or very positive, the logistic function approaches 0 or 1 more slowly than the normal CDF does. This means Logit predicts probabilities closer to 0.5 in the tails compared to Probit.

Here’s a concrete comparison:

# Fit both models on the same data
logit_result = sm.Logit(y, X_with_const).fit()
probit_result = sm.Probit(y, X_with_const).fit()

# Compare predictions
logit_probs = logit_result.predict(X_with_const)
probit_probs = probit_result.predict(X_with_const)

# Look at differences
diff = logit_probs - probit_probs
print(f"Mean absolute difference: {np.abs(diff).mean():.6f}")
print(f"Max absolute difference: {np.abs(diff).max():.6f}")

On typical datasets, the mean absolute difference in predicted probabilities is around 0.001 to 0.01. The maximum difference happens at the extreme predictions near 0 or 1.

Coefficient Interpretation Differs Between Models

With Logit, the coefficients represent log-odds ratios. If I have a coefficient of 0.5 for a binary predictor, it means the odds of y=1 are multiplied by exp(0.5) ≈ 1.65 when that predictor changes from 0 to 1, holding everything else constant.

I can exponentiate the coefficients to get odds ratios directly:

# Get odds ratios from logit model
odds_ratios = np.exp(logit_result.params)
conf_intervals = np.exp(logit_result.conf_int())

print("Odds Ratios:")
for name, or_value, (lower, upper) in zip(
    ['const', 'x1', 'x2', 'x3'], 
    odds_ratios, 
    conf_intervals.values
):
    print(f"{name}: {or_value:.3f} (95% CI: {lower:.3f} - {upper:.3f})")

Probit coefficients don’t have this clean interpretation. A Probit coefficient tells you how much a one-unit change in the predictor shifts the z-score of the underlying latent variable. This is less intuitive for most audiences.

When I’m presenting results to non-statisticians, I choose Logit because I can say “this intervention increases the odds by 65%” rather than trying to explain z-score shifts. The odds ratio interpretation connects directly to how people think about risk.

Getting Marginal Effects for Both Models

Raw coefficients from either model are hard to interpret on the probability scale because the relationship is nonlinear. A one-unit change in a predictor has different effects on probability depending on where you start.

I use marginal effects to answer the question: “If I increase this predictor by one unit, how much does the predicted probability change on average?”

# Calculate average marginal effects
logit_margins = logit_result.get_margeff()
print(logit_margins.summary())

probit_margins = probit_result.get_margeff()
print(probit_margins.summary())

The marginal effects from Logit and Probit are typically very similar. This confirms that the choice between models matters more for coefficient interpretation than for actual predictions or effects.

Practical Workflow Details

Statsmodels requires you to manually add a constant term for the intercept. This catches me every time I switch from sklearn:

# WRONG - no intercept unless you add it
X_wrong = pd.DataFrame({'feature1': [1, 2, 3], 'feature2': [4, 5, 6]})
# RIGHT - add constant
X_right = sm.add_constant(X_wrong)

The library also expects X and y in a different order than sklearn. Statsmodels uses Model(y, X) while sklearn uses Model().fit(X, y). I’ve wasted debugging time on this swap more than once.

For model selection, I look at the AIC and BIC values in the summary output. Lower values indicate better fit penalized for model complexity. When comparing nested models (one model is a subset of another), I use the likelihood ratio test:

# Fit reduced model without feature 3
X_reduced = X_with_const[:, :-1]
reduced_result = sm.Logit(y, X_reduced).fit()

# Likelihood ratio test
from scipy import stats
lr_stat = 2 * (logit_result.llf - reduced_result.llf)
p_value = stats.chi2.sf(lr_stat, df=1)
print(f"LR statistic: {lr_stat:.3f}, p-value: {p_value:.4f}")

When I Choose Probit Over Logit

I default to Logit for most work, but I switch to Probit in specific situations. Probit is the right choice when I’m working with a dataset where the underlying data generation process involves normally distributed errors or latent variables.

The classic example is modeling survey responses where you assume there’s a continuous latent propensity that gets discretized at some threshold. If someone reports yes/no to “would you buy this product,” there’s probably an underlying continuous willingness-to-pay that crosses a threshold. The normal distribution assumption for that latent variable makes Probit the natural model.

Probit also makes certain econometric models more tractable. When I’m building more complex models like bivariate probit or sample selection models, starting with Probit gives me computational advantages because the normal distribution has nice mathematical properties.

Handling Class Imbalance

Statsmodels doesn’t have built-in class weights like sklearn does. When I have imbalanced data, I manually weight the observations:

# Create sample weights based on class frequency
class_counts = pd.Series(y).value_counts()
weights = y.map({0: 1/class_counts[0], 1: 1/class_counts[1]})

# Fit with frequency weights
weighted_result = sm.Logit(y, X_with_const).fit(fweights=weights)

The fweights parameter treats each observation as if it were repeated that many times. For more complex weighting schemes, I use the sample_weight parameter in the fit method.

Prediction and Classification

Getting predictions from a fitted model requires calling the predict method with new data that includes the constant term:

# New data for prediction
X_new = np.array([[1.5, -0.5, 0.8]])
X_new_with_const = sm.add_constant(X_new, has_constant='add')

# Get predicted probability
prob = logit_result.predict(X_new_with_const)
print(f"Predicted probability: {prob[0]:.4f}")

# Convert to class prediction
threshold = 0.5
prediction = (prob > threshold).astype(int)

The model doesn’t have a built-in predict_proba or predict method like sklearn. You always get probabilities from the predict method, then threshold them yourself if you need class labels.

Model Diagnostics I Actually Use

The summary table gives you basic model fit statistics, but I dig deeper when something seems off. The pseudo R-squared values (McFadden, Cox-Snell, Nagelkerke) appear in the summary output. These aren’t true R-squared values like in linear regression, but they give a rough sense of model fit.

I check for separation problems by looking at coefficient standard errors. If any standard error is extremely large (say, greater than 10) or if the optimization doesn’t converge, I probably have perfect or quasi-perfect separation. This happens when one predictor almost perfectly predicts the outcome.

# Check for convergence issues
if not logit_result.mle_retvals['converged']:
    print("Warning: Model did not converge")
    
# Look for suspiciously large standard errors
if (logit_result.bse > 10).any():
    print("Warning: Very large standard errors detected")
    print(logit_result.bse[logit_result.bse > 10])

For residual analysis, I use Pearson and deviance residuals:

pearson_resid = logit_result.resid_pearson
deviance_resid = logit_result.resid_deviance

# Plot residuals against fitted values
import matplotlib.pyplot as plt
plt.scatter(logit_result.fittedvalues, pearson_resid, alpha=0.3)
plt.xlabel('Fitted Values')
plt.ylabel('Pearson Residuals')
plt.axhline(y=0, color='r', linestyle='--')
plt.show()

Large residuals indicate observations that the model fits poorly. I investigate these cases to see if there’s a pattern the model misses or if they’re genuine outliers.

Converting Between Statsmodels and Sklearn

Sometimes I start with Statsmodels for analysis but need to deploy the model using sklearn’s pipeline infrastructure. The coefficient values transfer directly:

from sklearn.linear_model import LogisticRegression

# Get coefficients from Statsmodels (excluding intercept)
sm_coefs = logit_result.params[1:]
sm_intercept = logit_result.params[0]

# Create equivalent sklearn model
sklearn_model = LogisticRegression()
sklearn_model.coef_ = sm_coefs.values.reshape(1, -1)
sklearn_model.intercept_ = np.array([sm_intercept])
sklearn_model.classes_ = np.array([0, 1])

# Verify predictions match
X_test = np.random.randn(10, 3)
X_test_with_const = sm.add_constant(X_test)

sm_pred = logit_result.predict(X_test_with_const)
sklearn_pred = sklearn_model.predict_proba(X_test)[:, 1]

print(f"Max prediction difference: {np.abs(sm_pred - sklearn_pred).max():.10f}")

The predictions match to machine precision. This lets me do my statistical analysis in Statsmodels, verify the coefficient significance, then deploy with sklearn if needed.

Why Statsmodels When Sklearn Exists?

Logit vs Probit: The Link Function Decision

Coefficient Interpretation Differs Between Models

Getting Marginal Effects for Both Models

Practical Workflow Details

When I Choose Probit Over Logit

Handling Class Imbalance

Prediction and Classification

Model Diagnostics I Actually Use

Converting Between Statsmodels and Sklearn

Ninad

Related Posts

Python Statsmodels Linear Mixed Effects Models

Statsmodels Robust Linear Models

Statsmodel Generalized Additive Models (GAM): Modeling Non-Linear Relationships