Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Handling duplicate values from datasets in python
Duplicate values are identical rows or records that appear multiple times in a dataset. They can occur due to data entry errors, system glitches, or data merging issues. In this article, we'll explore how to identify and handle duplicate values in Python using pandas.
What are Duplicate Values?
Duplicate values are data points that have identical values across all or specific columns. These duplicates can skew analysis results and create bias in machine learning models, making proper handling essential for data quality.
Identifying Duplicate Values
The first step in handling duplicates is identifying them. Pandas provides the duplicated() method to detect duplicate rows ?
import pandas as pd
# Create a sample DataFrame with duplicate values
data = pd.DataFrame({
'name': ['John', 'Emily', 'John', 'Jane', 'John'],
'age': [25, 28, 25, 30, 25],
'salary': [50000, 60000, 50000, 70000, 50000]
})
print("Original DataFrame:")
print(data)
print("\nDuplicate rows (True means duplicate):")
print(data.duplicated())
Original DataFrame:
name age salary
0 John 25 50000
1 Emily 28 60000
2 John 25 50000
3 Jane 30 70000
4 John 25 50000
Duplicate rows (True means duplicate):
0 False
1 False
2 True
3 False
4 True
dtype: bool
To view only the duplicate rows ?
import pandas as pd
data = pd.DataFrame({
'name': ['John', 'Emily', 'John', 'Jane', 'John'],
'age': [25, 28, 25, 30, 25],
'salary': [50000, 60000, 50000, 70000, 50000]
})
# Show only duplicate rows
duplicates = data[data.duplicated()]
print("Duplicate rows:")
print(duplicates)
Duplicate rows: name age salary 2 John 25 50000 4 John 25 50000
Method 1: Removing All Duplicates
The simplest approach is to remove all duplicate rows using drop_duplicates() ?
import pandas as pd
data = pd.DataFrame({
'name': ['John', 'Emily', 'John', 'Jane', 'John'],
'age': [25, 28, 25, 30, 25],
'salary': [50000, 60000, 50000, 70000, 50000]
})
# Remove duplicate rows (keeps first occurrence by default)
clean_data = data.drop_duplicates()
print("After removing duplicates:")
print(clean_data)
After removing duplicates:
name age salary
0 John 25 50000
1 Emily 28 60000
3 Jane 30 70000
Method 2: Keeping First or Last Occurrence
You can control which duplicate to keep using the keep parameter ?
import pandas as pd
data = pd.DataFrame({
'name': ['John', 'Emily', 'John', 'Jane', 'John'],
'age': [25, 28, 25, 30, 25],
'salary': [50000, 60000, 50000, 70000, 50000]
})
# Keep last occurrence of duplicates
keep_last = data.drop_duplicates(keep='last')
print("Keep last occurrence:")
print(keep_last)
# Remove all duplicates (keep none)
keep_none = data.drop_duplicates(keep=False)
print("\nKeep none (remove all duplicates):")
print(keep_none)
Keep last occurrence:
name age salary
1 Emily 28 60000
3 Jane 30 70000
4 John 25 50000
Keep none (remove all duplicates):
name age salary
1 Emily 28 60000
3 Jane 30 70000
Method 3: Handling Duplicates Based on Specific Columns
You can identify duplicates based on specific columns rather than all columns ?
import pandas as pd
data = pd.DataFrame({
'name': ['John', 'Emily', 'John', 'Jane', 'John'],
'age': [25, 28, 30, 30, 25],
'salary': [50000, 60000, 55000, 70000, 50000]
})
print("Original data:")
print(data)
# Remove duplicates based on 'name' column only
name_duplicates = data.drop_duplicates(subset=['name'])
print("\nRemove duplicates based on name:")
print(name_duplicates)
Original data:
name age salary
0 John 25 50000
1 Emily 28 60000
2 John 30 55000
3 Jane 30 70000
4 John 25 50000
Remove duplicates based on name:
name age salary
0 John 25 50000
1 Emily 28 60000
3 Jane 30 70000
Comparison of Methods
| Method | Parameters | Best For |
|---|---|---|
drop_duplicates() |
keep='first' (default) | Standard duplicate removal |
drop_duplicates(keep='last') |
keep='last' | When recent data is more accurate |
drop_duplicates(keep=False) |
keep=False | Removing all duplicated records |
drop_duplicates(subset=[]) |
subset=['column'] | Column-specific deduplication |
Conclusion
Handling duplicate values is essential for data quality and accurate analysis. Use duplicated() to identify duplicates and drop_duplicates() with appropriate parameters to handle them based on your specific requirements.
