Oracle SQL Query to Identify and Handle Duplicate Records

Have you ever discovered that your database contains multiple identical records, causing data integrity issues and skewing your business reports?

Duplicate records represent one of the most common data quality challenges in Oracle databases.

These unwanted copies can emerge from various sources including data imports, application bugs, concurrent transactions, or human error during manual data entry.

Managing duplicate records requires a systematic approach that involves accurate identification, careful removal, and robust prevention strategies.

Oracle SQL provides several powerful techniques to address this challenge effectively.

What Exactly Are Duplicate Records in Oracle Databases?

Duplicate records occur when two or more rows in a table contain identical values across all columns or a specific subset of columns that should be unique.

Complete duplicates have identical values in every column, while partial duplicates share the same values only in certain key fields that define business uniqueness.

Illustration of identifying and cleaning duplicate database records, ending with a clean database.

The impact of duplicate records extends beyond storage concerns, affecting data accuracy, query performance, and business decision-making processes.

How Can You Identify Duplicate Records Using Oracle SQL?

Oracle offers multiple approaches to detect duplicate records, each suited for different scenarios and performance requirements.

Data Preparation for Examples

Let me create sample data to demonstrate various duplicate detection techniques:

CREATE TABLE employees (
    emp_id NUMBER,
    first_name VARCHAR2(50),
    last_name VARCHAR2(50),
    email VARCHAR2(100),
    department VARCHAR2(50),
    hire_date DATE,
    salary NUMBER
);
INSERT INTO employees VALUES (1, 'John', 'Smith', 'john.smith@company.com', 'Sales', DATE '2025-01-15', 50000);
INSERT INTO employees VALUES (2, 'Jane', 'Doe', 'jane.doe@company.com', 'Marketing', DATE '2025-02-01', 55000);
INSERT INTO employees VALUES (3, 'John', 'Smith', 'john.smith@company.com', 'Sales', DATE '2025-01-15', 50000);
INSERT INTO employees VALUES (4, 'Mike', 'Johnson', 'mike.johnson@company.com', 'IT', DATE '2025-01-20', 60000);
INSERT INTO employees VALUES (5, 'Jane', 'Doe', 'jane.doe@company.com', 'HR', DATE '2025-02-01', 55000);
INSERT INTO employees VALUES (6, 'Sarah', 'Wilson', 'sarah.wilson@company.com', 'Finance', DATE '2025-01-25', 52000);
INSERT INTO employees VALUES (7, 'John', 'Smith', 'john.different@company.com', 'Sales', DATE '2025-01-15', 50000);

Current data in the employees table:

SELECT * FROM employees ORDER BY emp_id;
EMP_IDFIRST_NAMELAST_NAMEEMAILDEPARTMENTHIRE_DATESALARY
1JohnSmithjohn.smith@company.comSales2025-01-1550000
2JaneDoejane.doe@company.comMarketing2025-02-0155000
3JohnSmithjohn.smith@company.comSales2025-01-1550000
4MikeJohnsonmike.johnson@company.comIT2025-01-2060000
5JaneDoejane.doe@company.comHR2025-02-0155000
6SarahWilsonsarah.wilson@company.comFinance2025-01-2552000
7JohnSmithjohn.different@company.comSales2025-01-1550000

Method 1: Using GROUP BY and HAVING Clause

This query will identify duplicate records by grouping identical combinations and counting occurrences:

SELECT first_name, last_name, email, department, hire_date, salary, COUNT(*) as duplicate_count
FROM employees
GROUP BY first_name, last_name, email, department, hire_date, salary
HAVING COUNT(*) > 1;
FIRST_NAMELAST_NAMEEMAILDEPARTMENTHIRE_DATESALARYDUPLICATE_COUNT
JohnSmithjohn.smith@company.comSales2025-01-15500002
JaneDoejane.doe@company.comMarketing2025-02-01550002

Method 2: Using ROW_NUMBER() Window Function

This query will assign row numbers to identify duplicate records and their positions:

SELECT emp_id, first_name, last_name, email, department, hire_date, salary,
       ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY emp_id) as row_num
FROM employees;
EMP_IDFIRST_NAMELAST_NAMEEMAILDEPARTMENTHIRE_DATESALARYROW_NUM
2JaneDoejane.doe@company.comMarketing2025-02-01550001
5JaneDoejane.doe@company.comHR2025-02-01550002
7JohnSmithjohn.different@company.comSales2025-01-15500001
1JohnSmithjohn.smith@company.comSales2025-01-15500001
3JohnSmithjohn.smith@company.comSales2025-01-15500002
4MikeJohnsonmike.johnson@company.comIT2025-01-20600001
6SarahWilsonsarah.wilson@company.comFinance2025-01-25520001

Method 3: Using EXISTS Subquery

This query will find records that have duplicates elsewhere in the table:

SELECT e1.emp_id, e1.first_name, e1.last_name, e1.email
FROM employees e1
WHERE EXISTS (
    SELECT 1 FROM employees e2
    WHERE e2.first_name = e1.first_name
    AND e2.last_name = e1.last_name
    AND e2.email = e1.email
    AND e2.emp_id != e1.emp_id
);
EMP_IDFIRST_NAMELAST_NAMEEMAIL
1JohnSmithjohn.smith@company.com
2JaneDoejane.doe@company.com
3JohnSmithjohn.smith@company.com
5JaneDoejane.doe@company.com

What Are the Best Methods to Remove Duplicate Records?

Oracle provides several strategies for removing duplicate records, each with specific use cases and performance characteristics.

Method 1: Using ROW_NUMBER() with DELETE

This approach will delete all duplicate records except the first occurrence based on a specified order:

DELETE FROM employees
WHERE emp_id IN (
    SELECT emp_id FROM (
        SELECT emp_id,
               ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY emp_id) as rn
        FROM employees
    ) WHERE rn > 1
);

Method 2: Using ROWID for Performance

This query will delete duplicates using ROWID for optimal performance:

DELETE FROM employees e1
WHERE ROWID > (
    SELECT MIN(ROWID)
    FROM employees e2
    WHERE e1.first_name = e2.first_name
    AND e1.last_name = e2.last_name
    AND e1.email = e2.email
);

Method 3: Creating a Clean Table

This approach will create a new table with unique records and replace the original:

CREATE TABLE employees_clean AS
SELECT DISTINCT * FROM employees;
DROP TABLE employees;
RENAME employees_clean TO employees;

How Can You Prevent Duplicate Records from Occurring?

Prevention strategies prove more effective than cleanup efforts after duplicates have accumulated in your database.

Primary Key Constraints

Primary key constraints automatically prevent duplicate records by ensuring uniqueness across specified columns.

ALTER TABLE employees ADD CONSTRAINT pk_employees PRIMARY KEY (emp_id);

Unique Constraints

Unique constraints prevent duplicates on specific column combinations while allowing null values.

ALTER TABLE employees ADD CONSTRAINT uk_employee_email UNIQUE (email);

Composite Unique Constraints

Composite constraints ensure uniqueness across multiple column combinations for business rules.

ALTER TABLE employees ADD CONSTRAINT uk_employee_name_dept UNIQUE (first_name, last_name, department);

Which Approach Offers the Best Performance for Large Tables?

Performance considerations become critical when dealing with large datasets containing millions of records.

The ROWID-based deletion method typically provides superior performance because ROWID represents the fastest way to access specific rows in Oracle.

Window functions like ROW_NUMBER() offer excellent readability and flexibility but may consume more memory for large result sets.

EXISTS subqueries can leverage indexes effectively but may require careful tuning for optimal performance.

How Do You Handle Partial Duplicates Based on Business Rules?

Business requirements often define duplicates based on specific column combinations rather than complete record matching.

Identifying Email Duplicates

This query will find records with duplicate email addresses regardless of other field differences:

SELECT email, COUNT(*) as count
FROM employees
GROUP BY email
HAVING COUNT(*) > 1;
EMAILCOUNT
jane.doe@company.com2
john.smith@company.com2

Handling Name-Based Duplicates

This approach will identify potential duplicate persons based on name combinations using Oracle Listagg function:

SELECT first_name, last_name, COUNT(*) as count,
       LISTAGG(email, '; ') WITHIN GROUP (ORDER BY email) as all_emails
FROM employees
GROUP BY first_name, last_name
HAVING COUNT(*) > 1;
FIRST_NAMELAST_NAMECOUNTALL_EMAILS
JaneDoe2jane.doe@company.com; jane.doe@company.com
JohnSmith3john.different@company.com; john.smith@company.com; john.smith@company.com

What Advanced Techniques Help with Complex Duplicate Scenarios?

Complex duplicate scenarios require sophisticated approaches that consider multiple factors and business logic.

Using Analytical Functions for Ranking

This query will rank duplicates based on multiple criteria to determine which record to keep:

SELECT emp_id, first_name, last_name, email, department, hire_date,
       RANK() OVER (PARTITION BY first_name, last_name, email ORDER BY hire_date DESC, emp_id) as keep_rank
FROM employees;
EMP_IDFIRST_NAMELAST_NAMEEMAILDEPARTMENTHIRE_DATEKEEP_RANK
2JaneDoejane.doe@company.comMarketing2025-02-011
5JaneDoejane.doe@company.comHR2025-02-012
1JohnSmithjohn.smith@company.comSales2025-01-151
3JohnSmithjohn.smith@company.comSales2025-01-152
7JohnSmithjohn.different@company.comSales2025-01-151
4MikeJohnsonmike.johnson@company.comIT2025-01-201
6SarahWilsonsarah.wilson@company.comFinance2025-01-251

Fuzzy Matching for Similar Records

Fuzzy matching techniques help identify records that are similar but not exactly identical.

SELECT e1.emp_id, e1.first_name, e1.last_name, e1.email,
       e2.emp_id as similar_emp_id, e2.first_name as similar_first_name, 
       e2.last_name as similar_last_name, e2.email as similar_email
FROM employees e1, employees e2
WHERE e1.emp_id < e2.emp_id
AND UPPER(e1.first_name) = UPPER(e2.first_name)
AND UPPER(e1.last_name) = UPPER(e2.last_name)
AND e1.email != e2.email;
EMP_IDFIRST_NAMELAST_NAMEEMAILSIMILAR_EMP_IDSIMILAR_FIRST_NAMESIMILAR_LAST_NAMESIMILAR_EMAIL
1JohnSmithjohn.smith@company.com7JohnSmithjohn.different@company.com
3JohnSmithjohn.smith@company.com7JohnSmithjohn.different@company.com

How Do You Monitor and Maintain Data Quality Ongoing?

Establishing ongoing monitoring ensures that duplicate records do not accumulate over time.

Creating Duplicate Detection Views

This view will continuously monitor for duplicate records:

CREATE OR REPLACE VIEW v_duplicate_employees AS
SELECT first_name, last_name, email, COUNT(*) as duplicate_count
FROM employees
GROUP BY first_name, last_name, email
HAVING COUNT(*) > 1;

Implementing Trigger-Based Prevention

Database triggers can prevent duplicate insertion at the database level.

CREATE OR REPLACE TRIGGER trg_prevent_duplicate_email
BEFORE INSERT ON employees
FOR EACH ROW
DECLARE
    v_count NUMBER;
BEGIN
    SELECT COUNT(*) INTO v_count
    FROM employees
    WHERE email = :NEW.email;
    
    IF v_count > 0 THEN
        RAISE_APPLICATION_ERROR(-20001, 'Email already exists: ' || :NEW.email);
    END IF;
END;

What Tools and Utilities Support Duplicate Management?

Oracle provides several built-in utilities and features that facilitate duplicate record management at scale.

Oracle Data Pump enables exporting unique records while filtering duplicates during data migration processes.

Oracle SQL Developer includes data modeling tools that help identify potential duplicate issues during database design phases.

Enterprise Manager provides monitoring capabilities to track data quality metrics including duplicate detection across multiple databases.

See also: Oracle SQL Query to Implement Cohort Analysis for Customer Retention

Conclusion

Managing duplicate records in Oracle databases requires a comprehensive strategy that encompasses identification, removal, and prevention techniques.

The choice of method depends on factors such as data volume, performance requirements, and specific business rules defining what constitutes a duplicate.

Vinish Kapoor
Vinish Kapoor

Vinish Kapoor is a seasoned software development professional and a fervent enthusiast of artificial intelligence (AI). His impressive career spans over 25+ years, marked by a relentless pursuit of innovation and excellence in the field of information technology. As an Oracle ACE, Vinish has distinguished himself as a leading expert in Oracle technologies, a title awarded to individuals who have demonstrated their deep commitment, leadership, and expertise in the Oracle community.

guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments