top of page

Deduplicating Data with SQL

  • Writer: Elisha Antunes
    Elisha Antunes
  • Jun 7
  • 3 min read

Updated: Jun 24

One of the earliest challenges I encountered while working with policy data was understanding why seemingly duplicate records were showing up in my reports. I would write what I thought was a simple join across tables like policies, quote_events, and quote_event_venues, only to receive far more rows than I anticipated.


At first, I suspected a data issue. But as I examined the structure more closely, I realized the data was fine — the problem was in how I was querying it.


In this post, I’ll walk through:

  • Why SQL joins can cause unexpected row multiplication

  • Why DISTINCT often isn’t enough to fix it

  • How to use STRING_AGG() and grouping to consolidate detail

  • How CTEs help manage complexity and improve query structure


The Problem

Imagine you're working with a dataset that includes policy-level data, event details, and venue information. These are classic one-to-many relationships — a single policy may have multiple events, and each event may be associated with multiple venues.


If you join these tables without managing the granularity, you’ll end up with a row for every combination of policy, event, and venue. For example:

SELECT *
FROM policies p
JOIN quote_events e ON p.policy_id = e.policy_id
JOIN quote_event_venues v ON e.event_id = v.event_id;

This join is technically correct, but it multiplies rows. If a policy has two events and each event has three venues, you’ll end up with six rows. The data isn’t wrong — it’s just more detailed than expected.


Why DISTINCT Isn’t a Long-Term Solution

Initially, I tried to resolve this issue by adding DISTINCT to my query:

SELECT DISTINCT p.policy_number, p.premium
FROM policies p
JOIN quote_events e ON p.policy_id = e.policy_id
JOIN quote_event_venues v ON e.event_id = v.event_id;

This may look like a safe fix — but here’s an example of what it might return:

policy_number

premium

event_type

venue_address

ABC100101

550.00

Concert

123 Main St, Austin, TX

ABC100101

550.00

Conference

456 Elm Rd, Dallas, TX

ABC100101

550.00

Festival

789 Oak Ave, Houston, TX

Even though the SELECT clause only includes policy_number and premium, the join still expands the dataset because each policy is associated with multiple venues through the event relationship. Without addressing that granularity, you're left with repeated rows — and limited insight.


This reduced the number of rows, but it also masked the detail I needed. I could no longer see which venues were associated with each policy. While DISTINCT can be useful, it’s not the right tool when you need to retain detail without duplication.


Aggregating the Right Way with STRING_AGG()

What I actually needed was a way to represent the data at the policy level while still capturing the underlying detail. That’s where STRING_AGG() came in. Instead of creating multiple rows, I could concatenate related details into a single string.

SELECT
  p.policy_number,
  p.premium,
  STRING_AGG(DISTINCT CONCAT_WS(', ', v.venue_name, v.venue_city, v.venue_state), ' | ') AS venue_list
FROM policies p
JOIN quote_events e ON p.policy_id = e.policy_id
JOIN quote_event_venues v ON e.event_id = v.event_id
GROUP BY p.policy_number, p.premium;

And here’s how the output changes with this aggregation in place:

policy_number

premium

venue_list

ABC100101

550.00

123 Main St, Austin, TX

This approach provided exactly what I needed: one row per policy with a readable list of venues.


Structuring the Query with CTEs

As queries become more complex, maintaining clarity becomes important. That’s when I started using Common Table Expressions (CTEs) to isolate each piece of logic. Here’s how I applied it in this context:

WITH venue_summary AS (
  SELECT e.policy_id,
         STRING_AGG(DISTINCT CONCAT_WS(', ', v.venue_name, v.venue_city, v.venue_state), ' | ') AS venues
  FROM quote_events e
  JOIN quote_event_venues v ON e.event_id = v.event_id
  GROUP BY e.policy_id
)
SELECT
  p.policy_number,
  p.premium,
  vs.venues
FROM policies p
LEFT JOIN venue_summary vs ON p.policy_id = vs.policy_id;

Breaking the query into modular parts made it more readable, easier to maintain, and less error-prone.


Why This Pattern Matters

This lesson helped me better understand how to manage one-to-many joins and preserve the level of detail that matters. It also reinforced the idea that reporting requirements often call for summarized detail — not a flattened join.


Since then, I’ve used this pattern in multiple places:

  • Summarizing event types for a policy

  • Displaying venue and coverage data on a single row

  • Preventing accidental inflation in aggregate values like premiums or attendance


Takeaways

When you encounter row multiplication in SQL, it’s usually a sign that your joins are bringing in more detail than your output format requires. Instead of masking the issue with DISTINCT, the better approach is to reshape the data using aggregation.


That mindset has helped me write cleaner queries, build more reliable reports, and avoid misleading summaries.


My key takeaways:

  • One-to-many joins increase row count unless managed explicitly

  • DISTINCT may reduce rows, but can also strip out necessary detail

  • Use STRING_AGG() and GROUP BY to retain insight while avoiding duplication

  • Use CTEs to modularize and clarify your query structure


Recent Posts

See All
Join Tables Without Inflating Data

One of the earliest SQL lessons that stuck with me — and one that continues to surface in many data quality investigations — is the...

 
 
Tracking Premium Changes Over Time

Premiums don’t always stay fixed after a policy is bound — especially in specialty insurance. As endorsements (change transactions) are...

 
 
bottom of page