Oracle SQL Query to Identify and Handle Duplicate Records • Vinish.Dev

Have you ever discovered that your database contains multiple identical records, causing data integrity issues and skewing your business reports?

Duplicate records represent one of the most common data quality challenges in Oracle databases.

These unwanted copies can emerge from various sources including data imports, application bugs, concurrent transactions, or human error during manual data entry.

Managing duplicate records requires a systematic approach that involves accurate identification, careful removal, and robust prevention strategies.

Oracle SQL provides several powerful techniques to address this challenge effectively.

On This Page
Show More

What Exactly Are Duplicate Records in Oracle Databases?

Duplicate records occur when two or more rows in a table contain identical values across all columns or a specific subset of columns that should be unique.

Complete duplicates have identical values in every column, while partial duplicates share the same values only in certain key fields that define business uniqueness.

Illustration of identifying and cleaning duplicate database records, ending with a clean database.

The impact of duplicate records extends beyond storage concerns, affecting data accuracy, query performance, and business decision-making processes.

How Can You Identify Duplicate Records Using Oracle SQL?

Oracle offers multiple approaches to detect duplicate records, each suited for different scenarios and performance requirements.

Data Preparation for Examples

Let me create sample data to demonstrate various duplicate detection techniques:

CREATE TABLE employees (
    emp_id NUMBER,
    first_name VARCHAR2(50),
    last_name VARCHAR2(50),
    email VARCHAR2(100),
    department VARCHAR2(50),
    hire_date DATE,
    salary NUMBER
);

INSERT INTO employees VALUES (1, 'John', 'Smith', 'john.smith@company.com', 'Sales', DATE '2025-01-15', 50000);
INSERT INTO employees VALUES (2, 'Jane', 'Doe', 'jane.doe@company.com', 'Marketing', DATE '2025-02-01', 55000);
INSERT INTO employees VALUES (3, 'John', 'Smith', 'john.smith@company.com', 'Sales', DATE '2025-01-15', 50000);
INSERT INTO employees VALUES (4, 'Mike', 'Johnson', 'mike.johnson@company.com', 'IT', DATE '2025-01-20', 60000);
INSERT INTO employees VALUES (5, 'Jane', 'Doe', 'jane.doe@company.com', 'HR', DATE '2025-02-01', 55000);
INSERT INTO employees VALUES (6, 'Sarah', 'Wilson', 'sarah.wilson@company.com', 'Finance', DATE '2025-01-25', 52000);
INSERT INTO employees VALUES (7, 'John', 'Smith', 'john.different@company.com', 'Sales', DATE '2025-01-15', 50000);

Current data in the employees table:

SELECT * FROM employees ORDER BY emp_id;

EMP_ID	FIRST_NAME	LAST_NAME	EMAIL	DEPARTMENT	HIRE_DATE	SALARY
1	John	Smith	john.smith@company.com	Sales	2025-01-15	50000
2	Jane	Doe	jane.doe@company.com	Marketing	2025-02-01	55000
3	John	Smith	john.smith@company.com	Sales	2025-01-15	50000
4	Mike	Johnson	mike.johnson@company.com	IT	2025-01-20	60000
5	Jane	Doe	jane.doe@company.com	HR	2025-02-01	55000
6	Sarah	Wilson	sarah.wilson@company.com	Finance	2025-01-25	52000
7	John	Smith	john.different@company.com	Sales	2025-01-15	50000

Method 1: Using GROUP BY and HAVING Clause

This query will identify duplicate records by grouping identical combinations and counting occurrences:

SELECT first_name, last_name, email, department, hire_date, salary, COUNT(*) as duplicate_count
FROM employees
GROUP BY first_name, last_name, email, department, hire_date, salary
HAVING COUNT(*) > 1;

FIRST_NAME	LAST_NAME	EMAIL	DEPARTMENT	HIRE_DATE	SALARY	DUPLICATE_COUNT
John	Smith	john.smith@company.com	Sales	2025-01-15	50000	2
Jane	Doe	jane.doe@company.com	Marketing	2025-02-01	55000	2

Method 2: Using ROW_NUMBER() Window Function

This query will assign row numbers to identify duplicate records and their positions:

SELECT emp_id, first_name, last_name, email, department, hire_date, salary,
       ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY emp_id) as row_num
FROM employees;

EMP_ID	FIRST_NAME	LAST_NAME	EMAIL	DEPARTMENT	HIRE_DATE	SALARY	ROW_NUM
2	Jane	Doe	jane.doe@company.com	Marketing	2025-02-01	55000	1
5	Jane	Doe	jane.doe@company.com	HR	2025-02-01	55000	2
7	John	Smith	john.different@company.com	Sales	2025-01-15	50000	1
1	John	Smith	john.smith@company.com	Sales	2025-01-15	50000	1
3	John	Smith	john.smith@company.com	Sales	2025-01-15	50000	2
4	Mike	Johnson	mike.johnson@company.com	IT	2025-01-20	60000	1
6	Sarah	Wilson	sarah.wilson@company.com	Finance	2025-01-25	52000	1

Method 3: Using EXISTS Subquery

This query will find records that have duplicates elsewhere in the table:

SELECT e1.emp_id, e1.first_name, e1.last_name, e1.email
FROM employees e1
WHERE EXISTS (
    SELECT 1 FROM employees e2
    WHERE e2.first_name = e1.first_name
    AND e2.last_name = e1.last_name
    AND e2.email = e1.email
    AND e2.emp_id != e1.emp_id
);

EMP_ID	FIRST_NAME	LAST_NAME	EMAIL
1	John	Smith	john.smith@company.com
2	Jane	Doe	jane.doe@company.com
3	John	Smith	john.smith@company.com
5	Jane	Doe	jane.doe@company.com

What Are the Best Methods to Remove Duplicate Records?

Oracle provides several strategies for removing duplicate records, each with specific use cases and performance characteristics.

Method 1: Using ROW_NUMBER() with DELETE

This approach will delete all duplicate records except the first occurrence based on a specified order:

DELETE FROM employees
WHERE emp_id IN (
    SELECT emp_id FROM (
        SELECT emp_id,
               ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY emp_id) as rn
        FROM employees
    ) WHERE rn > 1
);

Method 2: Using ROWID for Performance

This query will delete duplicates using ROWID for optimal performance:

DELETE FROM employees e1
WHERE ROWID > (
    SELECT MIN(ROWID)
    FROM employees e2
    WHERE e1.first_name = e2.first_name
    AND e1.last_name = e2.last_name
    AND e1.email = e2.email
);

Method 3: Creating a Clean Table

This approach will create a new table with unique records and replace the original:

CREATE TABLE employees_clean AS
SELECT DISTINCT * FROM employees;

DROP TABLE employees;

RENAME employees_clean TO employees;

How Can You Prevent Duplicate Records from Occurring?

Prevention strategies prove more effective than cleanup efforts after duplicates have accumulated in your database.

Primary Key Constraints

Primary key constraints automatically prevent duplicate records by ensuring uniqueness across specified columns.

ALTER TABLE employees ADD CONSTRAINT pk_employees PRIMARY KEY (emp_id);

Unique Constraints

Unique constraints prevent duplicates on specific column combinations while allowing null values.

ALTER TABLE employees ADD CONSTRAINT uk_employee_email UNIQUE (email);

Composite Unique Constraints

Composite constraints ensure uniqueness across multiple column combinations for business rules.

ALTER TABLE employees ADD CONSTRAINT uk_employee_name_dept UNIQUE (first_name, last_name, department);

Which Approach Offers the Best Performance for Large Tables?

Performance considerations become critical when dealing with large datasets containing millions of records.

The ROWID-based deletion method typically provides superior performance because ROWID represents the fastest way to access specific rows in Oracle.

Window functions like ROW_NUMBER() offer excellent readability and flexibility but may consume more memory for large result sets.

EXISTS subqueries can leverage indexes effectively but may require careful tuning for optimal performance.

How Do You Handle Partial Duplicates Based on Business Rules?

Business requirements often define duplicates based on specific column combinations rather than complete record matching.

Identifying Email Duplicates

This query will find records with duplicate email addresses regardless of other field differences:

SELECT email, COUNT(*) as count
FROM employees
GROUP BY email
HAVING COUNT(*) > 1;

EMAIL	COUNT
jane.doe@company.com	2
john.smith@company.com	2

Handling Name-Based Duplicates

This approach will identify potential duplicate persons based on name combinations using Oracle Listagg function:

SELECT first_name, last_name, COUNT(*) as count,
       LISTAGG(email, '; ') WITHIN GROUP (ORDER BY email) as all_emails
FROM employees
GROUP BY first_name, last_name
HAVING COUNT(*) > 1;

FIRST_NAME	LAST_NAME	COUNT	ALL_EMAILS
Jane	Doe	2	jane.doe@company.com; jane.doe@company.com
John	Smith	3	john.different@company.com; john.smith@company.com; john.smith@company.com

What Advanced Techniques Help with Complex Duplicate Scenarios?

Complex duplicate scenarios require sophisticated approaches that consider multiple factors and business logic.

Using Analytical Functions for Ranking

This query will rank duplicates based on multiple criteria to determine which record to keep:

SELECT emp_id, first_name, last_name, email, department, hire_date,
       RANK() OVER (PARTITION BY first_name, last_name, email ORDER BY hire_date DESC, emp_id) as keep_rank
FROM employees;

EMP_ID	FIRST_NAME	LAST_NAME	EMAIL	DEPARTMENT	HIRE_DATE	KEEP_RANK
2	Jane	Doe	jane.doe@company.com	Marketing	2025-02-01	1
5	Jane	Doe	jane.doe@company.com	HR	2025-02-01	2
1	John	Smith	john.smith@company.com	Sales	2025-01-15	1
3	John	Smith	john.smith@company.com	Sales	2025-01-15	2
7	John	Smith	john.different@company.com	Sales	2025-01-15	1
4	Mike	Johnson	mike.johnson@company.com	IT	2025-01-20	1
6	Sarah	Wilson	sarah.wilson@company.com	Finance	2025-01-25	1

Fuzzy Matching for Similar Records

Fuzzy matching techniques help identify records that are similar but not exactly identical.

SELECT e1.emp_id, e1.first_name, e1.last_name, e1.email,
       e2.emp_id as similar_emp_id, e2.first_name as similar_first_name, 
       e2.last_name as similar_last_name, e2.email as similar_email
FROM employees e1, employees e2
WHERE e1.emp_id < e2.emp_id
AND UPPER(e1.first_name) = UPPER(e2.first_name)
AND UPPER(e1.last_name) = UPPER(e2.last_name)
AND e1.email != e2.email;

EMP_ID	FIRST_NAME	LAST_NAME	EMAIL	SIMILAR_EMP_ID	SIMILAR_FIRST_NAME	SIMILAR_LAST_NAME	SIMILAR_EMAIL
1	John	Smith	john.smith@company.com	7	John	Smith	john.different@company.com
3	John	Smith	john.smith@company.com	7	John	Smith	john.different@company.com

How Do You Monitor and Maintain Data Quality Ongoing?

Establishing ongoing monitoring ensures that duplicate records do not accumulate over time.

Creating Duplicate Detection Views

This view will continuously monitor for duplicate records:

CREATE OR REPLACE VIEW v_duplicate_employees AS
SELECT first_name, last_name, email, COUNT(*) as duplicate_count
FROM employees
GROUP BY first_name, last_name, email
HAVING COUNT(*) > 1;

Implementing Trigger-Based Prevention

Database triggers can prevent duplicate insertion at the database level.

CREATE OR REPLACE TRIGGER trg_prevent_duplicate_email
BEFORE INSERT ON employees
FOR EACH ROW
DECLARE
    v_count NUMBER;
BEGIN
    SELECT COUNT(*) INTO v_count
    FROM employees
    WHERE email = :NEW.email;
    
    IF v_count > 0 THEN
        RAISE_APPLICATION_ERROR(-20001, 'Email already exists: ' || :NEW.email);
    END IF;
END;

What Tools and Utilities Support Duplicate Management?

Oracle provides several built-in utilities and features that facilitate duplicate record management at scale.

Oracle Data Pump enables exporting unique records while filtering duplicates during data migration processes.

Oracle SQL Developer includes data modeling tools that help identify potential duplicate issues during database design phases.

Enterprise Manager provides monitoring capabilities to track data quality metrics including duplicate detection across multiple databases.

Conclusion

Managing duplicate records in Oracle databases requires a comprehensive strategy that encompasses identification, removal, and prevention techniques.

The choice of method depends on factors such as data volume, performance requirements, and specific business rules defining what constitutes a duplicate.