Article Categories

Selected Reading

A complete guide to resampling methods

Machine Learning Python Data Science

Resampling is a statistical technique for generating additional data samples to make inferences about populations or underlying processes. These methods are widely used when estimating population parameters from limited data or when traditional assumptions don't hold. Common resampling approaches include bootstrapping, jackknifing, and permutation testing, which help estimate standard errors, confidence intervals, and p-values without relying on distributional assumptions.

What is Bootstrapping?

Bootstrapping involves repeatedly sampling from a dataset with replacement to create new samples of the same size as the original. Each bootstrap sample is used to calculate a statistic of interest, and the distribution of these statistics estimates the sampling variability.

Advantages of Bootstrapping

Non-parametric: Makes no assumptions about population distribution
Robust: Resistant to outliers and non-normality
Versatile: Works with various statistics (mean, median, correlation, regression coefficients)
Accurate uncertainty estimates: Provides precise confidence intervals and hypothesis tests

Disadvantages of Bootstrapping

Computationally intensive: Especially with large datasets or complex statistics
Potential bias: Can introduce bias with small samples or highly skewed populations
Independence assumption: Not suitable for dependent data like time series

Bootstrap Example

Let's estimate the confidence interval for mean sepal length using the iris dataset ?

import numpy as np
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X = iris.data
sepal_length = X[:, 0]

# Bootstrap function
def bootstrap_statistic(data, n_bootstraps, statistic_func):
    """Generate bootstrap samples and calculate statistic for each."""
    bootstrap_stats = np.zeros(n_bootstraps)
    
    for i in range(n_bootstraps):
        # Sample with replacement
        bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
        bootstrap_stats[i] = statistic_func(bootstrap_sample)
    
    return bootstrap_stats

# Calculate original mean and bootstrap confidence interval
original_mean = np.mean(sepal_length)
bootstrap_means = bootstrap_statistic(sepal_length, n_bootstraps=1000, statistic_func=np.mean)

# Calculate 95% confidence interval
lower, upper = np.percentile(bootstrap_means, [2.5, 97.5])

print(f"Original Mean: {original_mean:.3f}")
print(f"Bootstrap 95% CI: ({lower:.3f}, {upper:.3f})")
print(f"Bootstrap Standard Error: {np.std(bootstrap_means):.3f}")

Original Mean: 5.843
Bootstrap 95% CI: (5.719, 5.966)
Bootstrap Standard Error: 0.064

What are Permutation Tests?

Permutation tests create new samples by randomly rearranging (permuting) the values in the original dataset. Unlike bootstrapping, permutation tests are primarily used for hypothesis testing, especially for comparing groups or testing relationships between variables.

Advantages of Permutation Tests

Distribution-free: No assumptions about population distributions
Exact p-values: Provides precise significance estimates
Flexible: Can be applied to various statistical tests (t-tests, ANOVA, correlation)
Small sample performance: Often more powerful than traditional tests with limited data

Disadvantages of Permutation Tests

Computational cost: Can be expensive with large datasets or many permutations
Limited applicability: May not work well with missing values or extreme outliers
Complexity: Can be harder to explain than traditional statistical tests

Permutation Test Example

Let's test if there's a significant difference in petal length between setosa and versicolor iris species ?

import numpy as np
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
setosa_petal = iris.data[:50, 2]      # First 50 samples (setosa)
versicolor_petal = iris.data[50:100, 2]  # Next 50 samples (versicolor)

# Calculate observed difference in means
observed_diff = np.mean(setosa_petal) - np.mean(versicolor_petal)

# Permutation test
n_permutations = 10000
permuted_diffs = []

# Combine both groups
combined_data = np.concatenate([setosa_petal, versicolor_petal])

for i in range(n_permutations):
    # Randomly permute the combined data
    permuted_data = np.random.permutation(combined_data)
    
    # Split into two groups of same size as original
    perm_group1 = permuted_data[:50]
    perm_group2 = permuted_data[50:100]
    
    # Calculate difference in means
    perm_diff = np.mean(perm_group1) - np.mean(perm_group2)
    permuted_diffs.append(abs(perm_diff))

# Calculate p-value (two-tailed test)
p_value = np.sum(np.array(permuted_diffs) >= abs(observed_diff)) / n_permutations

print(f"Observed difference: {observed_diff:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at ?=0.05: {p_value < 0.05}")

Observed difference: -2.798
P-value: 0.0000
Significant at ?=0.05: True

Comparison of Methods

Method	Primary Use	Sampling Strategy	Best For
Bootstrapping	Uncertainty estimation	With replacement	Confidence intervals, standard errors
Permutation Test	Hypothesis testing	Random rearrangement	Group comparisons, significance testing

Conclusion

Resampling methods are essential tools for robust statistical analysis when traditional assumptions don't hold. Bootstrapping excels at estimating uncertainty and confidence intervals, while permutation tests provide exact hypothesis testing without distributional assumptions. Both methods are computationally intensive but offer more reliable results than parametric alternatives in many real-world scenarios.

Jay Singh

Updated on: 2026-03-27T05:48:47+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started

Previous Next