Have you ever discovered that your database contains multiple identical records, causing data integrity issues and skewing your business reports?
Duplicate records represent one of the most common data quality challenges in Oracle databases.
These unwanted copies can emerge from various sources including data imports, application bugs, concurrent transactions, or human error during manual data entry.
Managing duplicate records requires a systematic approach that involves accurate identification, careful removal, and robust prevention strategies.
Oracle SQL provides several powerful techniques to address this challenge effectively.
What Exactly Are Duplicate Records in Oracle Databases?
Duplicate records occur when two or more rows in a table contain identical values across all columns or a specific subset of columns that should be unique.
Complete duplicates have identical values in every column, while partial duplicates share the same values only in certain key fields that define business uniqueness.

The impact of duplicate records extends beyond storage concerns, affecting data accuracy, query performance, and business decision-making processes.
How Can You Identify Duplicate Records Using Oracle SQL?
Oracle offers multiple approaches to detect duplicate records, each suited for different scenarios and performance requirements.
Data Preparation for Examples
Let me create sample data to demonstrate various duplicate detection techniques:
CREATE TABLE employees (
emp_id NUMBER,
first_name VARCHAR2(50),
last_name VARCHAR2(50),
email VARCHAR2(100),
department VARCHAR2(50),
hire_date DATE,
salary NUMBER
);
INSERT INTO employees VALUES (1, 'John', 'Smith', 'john.smith@company.com', 'Sales', DATE '2025-01-15', 50000); INSERT INTO employees VALUES (2, 'Jane', 'Doe', 'jane.doe@company.com', 'Marketing', DATE '2025-02-01', 55000); INSERT INTO employees VALUES (3, 'John', 'Smith', 'john.smith@company.com', 'Sales', DATE '2025-01-15', 50000); INSERT INTO employees VALUES (4, 'Mike', 'Johnson', 'mike.johnson@company.com', 'IT', DATE '2025-01-20', 60000); INSERT INTO employees VALUES (5, 'Jane', 'Doe', 'jane.doe@company.com', 'HR', DATE '2025-02-01', 55000); INSERT INTO employees VALUES (6, 'Sarah', 'Wilson', 'sarah.wilson@company.com', 'Finance', DATE '2025-01-25', 52000); INSERT INTO employees VALUES (7, 'John', 'Smith', 'john.different@company.com', 'Sales', DATE '2025-01-15', 50000);
Current data in the employees table:
SELECT * FROM employees ORDER BY emp_id;
| EMP_ID | FIRST_NAME | LAST_NAME | DEPARTMENT | HIRE_DATE | SALARY | |
|---|---|---|---|---|---|---|
| 1 | John | Smith | john.smith@company.com | Sales | 2025-01-15 | 50000 |
| 2 | Jane | Doe | jane.doe@company.com | Marketing | 2025-02-01 | 55000 |
| 3 | John | Smith | john.smith@company.com | Sales | 2025-01-15 | 50000 |
| 4 | Mike | Johnson | mike.johnson@company.com | IT | 2025-01-20 | 60000 |
| 5 | Jane | Doe | jane.doe@company.com | HR | 2025-02-01 | 55000 |
| 6 | Sarah | Wilson | sarah.wilson@company.com | Finance | 2025-01-25 | 52000 |
| 7 | John | Smith | john.different@company.com | Sales | 2025-01-15 | 50000 |
Method 1: Using GROUP BY and HAVING Clause
This query will identify duplicate records by grouping identical combinations and counting occurrences:
SELECT first_name, last_name, email, department, hire_date, salary, COUNT(*) as duplicate_count FROM employees GROUP BY first_name, last_name, email, department, hire_date, salary HAVING COUNT(*) > 1;
| FIRST_NAME | LAST_NAME | DEPARTMENT | HIRE_DATE | SALARY | DUPLICATE_COUNT | |
|---|---|---|---|---|---|---|
| John | Smith | john.smith@company.com | Sales | 2025-01-15 | 50000 | 2 |
| Jane | Doe | jane.doe@company.com | Marketing | 2025-02-01 | 55000 | 2 |
Method 2: Using ROW_NUMBER() Window Function
This query will assign row numbers to identify duplicate records and their positions:
SELECT emp_id, first_name, last_name, email, department, hire_date, salary,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY emp_id) as row_num
FROM employees;
| EMP_ID | FIRST_NAME | LAST_NAME | DEPARTMENT | HIRE_DATE | SALARY | ROW_NUM | |
|---|---|---|---|---|---|---|---|
| 2 | Jane | Doe | jane.doe@company.com | Marketing | 2025-02-01 | 55000 | 1 |
| 5 | Jane | Doe | jane.doe@company.com | HR | 2025-02-01 | 55000 | 2 |
| 7 | John | Smith | john.different@company.com | Sales | 2025-01-15 | 50000 | 1 |
| 1 | John | Smith | john.smith@company.com | Sales | 2025-01-15 | 50000 | 1 |
| 3 | John | Smith | john.smith@company.com | Sales | 2025-01-15 | 50000 | 2 |
| 4 | Mike | Johnson | mike.johnson@company.com | IT | 2025-01-20 | 60000 | 1 |
| 6 | Sarah | Wilson | sarah.wilson@company.com | Finance | 2025-01-25 | 52000 | 1 |
Method 3: Using EXISTS Subquery
This query will find records that have duplicates elsewhere in the table:
SELECT e1.emp_id, e1.first_name, e1.last_name, e1.email
FROM employees e1
WHERE EXISTS (
SELECT 1 FROM employees e2
WHERE e2.first_name = e1.first_name
AND e2.last_name = e1.last_name
AND e2.email = e1.email
AND e2.emp_id != e1.emp_id
);
| EMP_ID | FIRST_NAME | LAST_NAME | |
|---|---|---|---|
| 1 | John | Smith | john.smith@company.com |
| 2 | Jane | Doe | jane.doe@company.com |
| 3 | John | Smith | john.smith@company.com |
| 5 | Jane | Doe | jane.doe@company.com |
What Are the Best Methods to Remove Duplicate Records?
Oracle provides several strategies for removing duplicate records, each with specific use cases and performance characteristics.
Method 1: Using ROW_NUMBER() with DELETE
This approach will delete all duplicate records except the first occurrence based on a specified order:
DELETE FROM employees
WHERE emp_id IN (
SELECT emp_id FROM (
SELECT emp_id,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY emp_id) as rn
FROM employees
) WHERE rn > 1
);
Method 2: Using ROWID for Performance
This query will delete duplicates using ROWID for optimal performance:
DELETE FROM employees e1
WHERE ROWID > (
SELECT MIN(ROWID)
FROM employees e2
WHERE e1.first_name = e2.first_name
AND e1.last_name = e2.last_name
AND e1.email = e2.email
);
Method 3: Creating a Clean Table
This approach will create a new table with unique records and replace the original:
CREATE TABLE employees_clean AS SELECT DISTINCT * FROM employees;
DROP TABLE employees;
RENAME employees_clean TO employees;
How Can You Prevent Duplicate Records from Occurring?
Prevention strategies prove more effective than cleanup efforts after duplicates have accumulated in your database.
Primary Key Constraints
Primary key constraints automatically prevent duplicate records by ensuring uniqueness across specified columns.
ALTER TABLE employees ADD CONSTRAINT pk_employees PRIMARY KEY (emp_id);
Unique Constraints
Unique constraints prevent duplicates on specific column combinations while allowing null values.
ALTER TABLE employees ADD CONSTRAINT uk_employee_email UNIQUE (email);
Composite Unique Constraints
Composite constraints ensure uniqueness across multiple column combinations for business rules.
ALTER TABLE employees ADD CONSTRAINT uk_employee_name_dept UNIQUE (first_name, last_name, department);
Which Approach Offers the Best Performance for Large Tables?
Performance considerations become critical when dealing with large datasets containing millions of records.
The ROWID-based deletion method typically provides superior performance because ROWID represents the fastest way to access specific rows in Oracle.
Window functions like ROW_NUMBER() offer excellent readability and flexibility but may consume more memory for large result sets.
EXISTS subqueries can leverage indexes effectively but may require careful tuning for optimal performance.
How Do You Handle Partial Duplicates Based on Business Rules?
Business requirements often define duplicates based on specific column combinations rather than complete record matching.
Identifying Email Duplicates
This query will find records with duplicate email addresses regardless of other field differences:
SELECT email, COUNT(*) as count FROM employees GROUP BY email HAVING COUNT(*) > 1;
| COUNT | |
|---|---|
| jane.doe@company.com | 2 |
| john.smith@company.com | 2 |
Handling Name-Based Duplicates
This approach will identify potential duplicate persons based on name combinations using Oracle Listagg function:
SELECT first_name, last_name, COUNT(*) as count,
LISTAGG(email, '; ') WITHIN GROUP (ORDER BY email) as all_emails
FROM employees
GROUP BY first_name, last_name
HAVING COUNT(*) > 1;
| FIRST_NAME | LAST_NAME | COUNT | ALL_EMAILS |
|---|---|---|---|
| Jane | Doe | 2 | jane.doe@company.com; jane.doe@company.com |
| John | Smith | 3 | john.different@company.com; john.smith@company.com; john.smith@company.com |
What Advanced Techniques Help with Complex Duplicate Scenarios?
Complex duplicate scenarios require sophisticated approaches that consider multiple factors and business logic.
Using Analytical Functions for Ranking
This query will rank duplicates based on multiple criteria to determine which record to keep:
SELECT emp_id, first_name, last_name, email, department, hire_date,
RANK() OVER (PARTITION BY first_name, last_name, email ORDER BY hire_date DESC, emp_id) as keep_rank
FROM employees;
| EMP_ID | FIRST_NAME | LAST_NAME | DEPARTMENT | HIRE_DATE | KEEP_RANK | |
|---|---|---|---|---|---|---|
| 2 | Jane | Doe | jane.doe@company.com | Marketing | 2025-02-01 | 1 |
| 5 | Jane | Doe | jane.doe@company.com | HR | 2025-02-01 | 2 |
| 1 | John | Smith | john.smith@company.com | Sales | 2025-01-15 | 1 |
| 3 | John | Smith | john.smith@company.com | Sales | 2025-01-15 | 2 |
| 7 | John | Smith | john.different@company.com | Sales | 2025-01-15 | 1 |
| 4 | Mike | Johnson | mike.johnson@company.com | IT | 2025-01-20 | 1 |
| 6 | Sarah | Wilson | sarah.wilson@company.com | Finance | 2025-01-25 | 1 |
Fuzzy Matching for Similar Records
Fuzzy matching techniques help identify records that are similar but not exactly identical.
SELECT e1.emp_id, e1.first_name, e1.last_name, e1.email,
e2.emp_id as similar_emp_id, e2.first_name as similar_first_name,
e2.last_name as similar_last_name, e2.email as similar_email
FROM employees e1, employees e2
WHERE e1.emp_id < e2.emp_id
AND UPPER(e1.first_name) = UPPER(e2.first_name)
AND UPPER(e1.last_name) = UPPER(e2.last_name)
AND e1.email != e2.email;
| EMP_ID | FIRST_NAME | LAST_NAME | SIMILAR_EMP_ID | SIMILAR_FIRST_NAME | SIMILAR_LAST_NAME | SIMILAR_EMAIL | |
|---|---|---|---|---|---|---|---|
| 1 | John | Smith | john.smith@company.com | 7 | John | Smith | john.different@company.com |
| 3 | John | Smith | john.smith@company.com | 7 | John | Smith | john.different@company.com |
How Do You Monitor and Maintain Data Quality Ongoing?
Establishing ongoing monitoring ensures that duplicate records do not accumulate over time.
Creating Duplicate Detection Views
This view will continuously monitor for duplicate records:
CREATE OR REPLACE VIEW v_duplicate_employees AS SELECT first_name, last_name, email, COUNT(*) as duplicate_count FROM employees GROUP BY first_name, last_name, email HAVING COUNT(*) > 1;
Implementing Trigger-Based Prevention
Database triggers can prevent duplicate insertion at the database level.
CREATE OR REPLACE TRIGGER trg_prevent_duplicate_email
BEFORE INSERT ON employees
FOR EACH ROW
DECLARE
v_count NUMBER;
BEGIN
SELECT COUNT(*) INTO v_count
FROM employees
WHERE email = :NEW.email;
IF v_count > 0 THEN
RAISE_APPLICATION_ERROR(-20001, 'Email already exists: ' || :NEW.email);
END IF;
END;
What Tools and Utilities Support Duplicate Management?
Oracle provides several built-in utilities and features that facilitate duplicate record management at scale.
Oracle Data Pump enables exporting unique records while filtering duplicates during data migration processes.
Oracle SQL Developer includes data modeling tools that help identify potential duplicate issues during database design phases.
Enterprise Manager provides monitoring capabilities to track data quality metrics including duplicate detection across multiple databases.
See also: Oracle SQL Query to Implement Cohort Analysis for Customer Retention
Conclusion
Managing duplicate records in Oracle databases requires a comprehensive strategy that encompasses identification, removal, and prevention techniques.
The choice of method depends on factors such as data volume, performance requirements, and specific business rules defining what constitutes a duplicate.
