Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Perform Grubbs Test in Python
The Grubbs test is a statistical hypothesis testing method to detect outliers in a dataset. Outliers are observations that disturb the data distribution and can cause models to overfit. This article explains what the Grubbs test is and demonstrates how to implement it in Python using both built-in libraries and manual formula implementation.
What are Outliers?
Outliers are data points that are numerically distant from other observations in the dataset. For normally distributed data, approximately 68% of records should fall within one standard deviation, 95% within two standard deviations, and 99.7% within three standard deviations of the mean. Data points that fall outside the first and third quartile range are typically considered outliers.
Grubbs Statistical Hypothesis Test
The Grubbs test detects outliers by testing statistical hypotheses. It works with univariate datasets that follow an approximately normal distribution and contain at least seven observations. This test is also known as the extreme studentized deviation test or maximum normalized residual test.
The Grubbs test uses the following hypotheses ?
Null (H0): The dataset has no outliers
Alternate (H1): The dataset has exactly one outlier
The test can be performed as either a Two-Sided Test (detecting outliers on both ends) or a One-Sided Test (detecting outliers on one end only).
Using the outliers Library
Python provides the outliers library with built-in functions for performing the Grubbs test. First, install the library ?
!pip install outliers
Two-Sided Grubbs Test
The two-sided test detects outliers from both the minimum and maximum sides of the dataset ?
import numpy as np
from outliers import smirnov_grubbs as grubbs
# Define sample data with an outlier
data = np.array([5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])
# Perform two-sided Grubbs test
result = grubbs.test(data, alpha=0.05)
print("Original data:", data)
print("After removing outliers:", result)
Original data: [ 5 14 15 15 14 19 17 16 20 22 8 21 28 11 9 29 40] After removing outliers: [ 5 14 15 15 14 19 17 16 20 22 8 21 28 11 9 29]
One-Sided Grubbs Test
The one-sided test detects outliers from either the minimum side using min_test() or maximum side using max_test() ?
import numpy as np
from outliers import smirnov_grubbs as grubbs
data = np.array([5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])
# Test for minimum outliers
min_result = grubbs.min_test(data, alpha=0.05)
print("Min test result:", min_result)
# Test for maximum outliers
max_result = grubbs.max_test(data, alpha=0.05)
print("Max test result:", max_result)
Min test result: [ 5 14 15 15 14 19 17 16 20 22 8 21 28 11 9 29 40] Max test result: [ 5 14 15 15 14 19 17 16 20 22 8 21 28 11 9 29]
Manual Formula Implementation
You can also implement the Grubbs test manually using the mathematical formula. The test statistic is calculated as ?
import numpy as np
import scipy.stats as stats
def grubbs_test(data):
n = len(data)
mean_x = np.mean(data)
sd_x = np.std(data, ddof=1) # Sample standard deviation
# Calculate test statistic
numerator = max(abs(data - mean_x))
g_calculated = numerator / sd_x
# Calculate critical value
t_value = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
g_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
print(f"Grubbs Calculated Value: {g_calculated:.4f}")
print(f"Grubbs Critical Value: {g_critical:.4f}")
if g_calculated > g_critical:
print("Result: Outlier detected (reject null hypothesis)")
else:
print("Result: No outlier detected (accept null hypothesis)")
print()
# Test with data without outliers
data_no_outliers = np.array([12, 13, 14, 19, 21, 23])
print("Testing data without outliers:")
grubbs_test(data_no_outliers)
# Test with data containing outliers
data_with_outliers = np.array([12, 13, 14, 19, 21, 23, 45])
print("Testing data with outliers:")
grubbs_test(data_with_outliers)
Testing data without outliers: Grubbs Calculated Value: 1.4275 Grubbs Critical Value: 1.8871 Result: No outlier detected (accept null hypothesis) Testing data with outliers: Grubbs Calculated Value: 2.2765 Grubbs Critical Value: 2.0200 Result: Outlier detected (reject null hypothesis)
Comparison of Methods
| Method | Pros | Cons | Best For |
|---|---|---|---|
| Built-in Library | Easy to use, automatic removal | Less control over process | Quick outlier detection |
| Manual Implementation | Full control, understand statistics | More code required | Learning and customization |
Conclusion
The Grubbs test is an effective statistical method for detecting outliers in normally distributed datasets. You can use the outliers library for quick implementation or implement the formula manually for better understanding and control over the process.
