Article Categories

Selected Reading

Handling duplicate values from datasets in python

Python Server Side Programming Programming

Duplicate values are identical rows or records that appear multiple times in a dataset. They can occur due to data entry errors, system glitches, or data merging issues. In this article, we'll explore how to identify and handle duplicate values in Python using pandas.

What are Duplicate Values?

Duplicate values are data points that have identical values across all or specific columns. These duplicates can skew analysis results and create bias in machine learning models, making proper handling essential for data quality.

Identifying Duplicate Values

The first step in handling duplicates is identifying them. Pandas provides the duplicated() method to detect duplicate rows ?

import pandas as pd

# Create a sample DataFrame with duplicate values
data = pd.DataFrame({
    'name': ['John', 'Emily', 'John', 'Jane', 'John'],
    'age': [25, 28, 25, 30, 25],
    'salary': [50000, 60000, 50000, 70000, 50000]
})

print("Original DataFrame:")
print(data)
print("\nDuplicate rows (True means duplicate):")
print(data.duplicated())

Original DataFrame:
    name  age  salary
0   John   25   50000
1  Emily   28   60000
2   John   25   50000
3   Jane   30   70000
4   John   25   50000

Duplicate rows (True means duplicate):
0    False
1    False
2     True
3    False
4     True
dtype: bool

To view only the duplicate rows ?

import pandas as pd

data = pd.DataFrame({
    'name': ['John', 'Emily', 'John', 'Jane', 'John'],
    'age': [25, 28, 25, 30, 25],
    'salary': [50000, 60000, 50000, 70000, 50000]
})

# Show only duplicate rows
duplicates = data[data.duplicated()]
print("Duplicate rows:")
print(duplicates)

Duplicate rows:
   name  age  salary
2  John   25   50000
4  John   25   50000

Method 1: Removing All Duplicates

The simplest approach is to remove all duplicate rows using drop_duplicates() ?

import pandas as pd

data = pd.DataFrame({
    'name': ['John', 'Emily', 'John', 'Jane', 'John'],
    'age': [25, 28, 25, 30, 25],
    'salary': [50000, 60000, 50000, 70000, 50000]
})

# Remove duplicate rows (keeps first occurrence by default)
clean_data = data.drop_duplicates()
print("After removing duplicates:")
print(clean_data)

After removing duplicates:
    name  age  salary
0   John   25   50000
1  Emily   28   60000
3   Jane   30   70000

Method 2: Keeping First or Last Occurrence

You can control which duplicate to keep using the keep parameter ?

import pandas as pd

data = pd.DataFrame({
    'name': ['John', 'Emily', 'John', 'Jane', 'John'],
    'age': [25, 28, 25, 30, 25],
    'salary': [50000, 60000, 50000, 70000, 50000]
})

# Keep last occurrence of duplicates
keep_last = data.drop_duplicates(keep='last')
print("Keep last occurrence:")
print(keep_last)

# Remove all duplicates (keep none)
keep_none = data.drop_duplicates(keep=False)
print("\nKeep none (remove all duplicates):")
print(keep_none)

Keep last occurrence:
    name  age  salary
1  Emily   28   60000
3   Jane   30   70000
4   John   25   50000

Keep none (remove all duplicates):
    name  age  salary
1  Emily   28   60000
3   Jane   30   70000

Method 3: Handling Duplicates Based on Specific Columns

You can identify duplicates based on specific columns rather than all columns ?

import pandas as pd

data = pd.DataFrame({
    'name': ['John', 'Emily', 'John', 'Jane', 'John'],
    'age': [25, 28, 30, 30, 25],
    'salary': [50000, 60000, 55000, 70000, 50000]
})

print("Original data:")
print(data)

# Remove duplicates based on 'name' column only
name_duplicates = data.drop_duplicates(subset=['name'])
print("\nRemove duplicates based on name:")
print(name_duplicates)

Original data:
    name  age  salary
0   John   25   50000
1  Emily   28   60000
2   John   30   55000
3   Jane   30   70000
4   John   25   50000

Remove duplicates based on name:
    name  age  salary
0   John   25   50000
1  Emily   28   60000
3   Jane   30   70000

Comparison of Methods

Method	Parameters	Best For
`drop_duplicates()`	keep='first' (default)	Standard duplicate removal
`drop_duplicates(keep='last')`	keep='last'	When recent data is more accurate
`drop_duplicates(keep=False)`	keep=False	Removing all duplicated records
`drop_duplicates(subset=[])`	subset=['column']	Column-specific deduplication

Conclusion

Handling duplicate values is essential for data quality and accurate analysis. Use duplicated() to identify duplicates and drop_duplicates() with appropriate parameters to handle them based on your specific requirements.

Premansh Sharma

Updated on: 2026-03-27T00:25:28+05:30

7K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started

Previous Next