In many database-driven applications, tables often store multiple records for the same business entity. Whether it’s logs, transactions, status snapshots, or historical data, it’s common to encounter duplicate keys with different update timestamps.
To improve performance or prepare clean datasets for reporting, it’s often necessary to retain only the most recent record (based on a timestamp or date column) for each unique combination of identifiers.
This article walks you through how to:
- Select only the most up-to-date row per group.
- Delete older records while preserving the latest entry.
- Use SQL techniques compatible with Oracle and SQL Server (MSSQL).
Use Case
Imagine a table with these columns:
entity_id | category_id | status_code | updated_at | value |
---|---|---|---|---|
1001 | A1 | ACTIVE | 2024-12-01 08:00:00 | 1.23 |
1001 | A1 | ACTIVE | 2024-12-02 10:30:00 | 1.25 |
1001 | A1 | INACTIVE | 2024-12-03 11:00:00 | 1.20 |
Your goal is to:
- Keep only one row per unique combination of
entity_id
,category_id
, andstatus_code
. - Keep the row with the most recent
updated_at
.
Solution Overview
We use the ROW_NUMBER()
window function to rank records within each group and order them by date descending. Then we:
- Filter for the top-ranked (latest) rows.
- Optionally, delete the lower-ranked (older) entries.
This works in both Oracle and SQL Server.
SQL Example: Oracle
Filter: Get the Most Recent Row per Group
WITH RankedRecords AS (
SELECT
entity_id,
category_id,
status_code,
updated_at,
value,
ROW_NUMBER() OVER (
PARTITION BY entity_id, category_id, status_code
ORDER BY updated_at DESC
) AS rn
FROM records
)
SELECT *
FROM RankedRecords
WHERE rn = 1;
🗑️ Delete: Remove All But the Most Recent Record per Group
DELETE FROM records
WHERE (entity_id, category_id, status_code, updated_at) NOT IN (
WITH RankedRecords AS (
SELECT
entity_id,
category_id,
status_code,
updated_at,
ROW_NUMBER() OVER (
PARTITION BY entity_id, category_id, status_code
ORDER BY updated_at DESC
) AS rn
FROM records
)
SELECT
entity_id,
category_id,
status_code,
updated_at
FROM RankedRecords
WHERE rn = 1
);
🧱 SQL Example: Microsoft SQL Server
🔍 Filter: Select Most Updated Row per Group
WITH RankedRecords AS (
SELECT
entity_id,
category_id,
status_code,
updated_at,
value,
ROW_NUMBER() OVER (
PARTITION BY entity_id, category_id, status_code
ORDER BY updated_at DESC
) AS rn
FROM records
)
SELECT *
FROM RankedRecords
WHERE rn = 1;
🗑️ Delete: Remove Older Entries per Group
WITH CTE AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY entity_id, category_id, status_code
ORDER BY updated_at DESC
) AS RowNum
FROM records
)
DELETE FROM CTE
WHERE RowNum > 1;
✅ Why This Works
ROW_NUMBER()
creates a unique sequence per group.PARTITION BY
ensures separate rankings for each group.- Ordering by
updated_at DESC
ensures the most recent row is ranked first. - Filtering or deleting based on row number lets you efficiently de-duplicate by recency.
⚠️ Best Practices
- Ensure
updated_at
is non-null and properly indexed. - Always backup your table or test the query in a sandbox before deleting.
- Use composite indexes on grouping and date columns for better performance.
- When dealing with large datasets, consider batch deletes to avoid locks:
-- MSSQL batch delete example
WITH CTE AS (
SELECT TOP (1000)
*,
ROW_NUMBER() OVER (
PARTITION BY entity_id, category_id, status_code
ORDER BY updated_at DESC
) AS RowNum
FROM records
)
DELETE FROM CTE
WHERE RowNum > 1;
Conclusion
When you’re dealing with datasets that contain multiple versions of the same logical entity, it’s often critical to retain only the latest record per group. Using ROW_NUMBER()
with PARTITION BY
and ORDER BY
, you can elegantly solve this problem in Oracle and SQL Server.
This technique is widely applicable in data deduplication, reporting, data warehousing, and archiving scenarios — and it’s both performant and easy to maintain.