Join Tables Without Inflating Data

Elisha Antunes
Jun 7
3 min read

Updated: Jun 22

One of the earliest SQL lessons that stuck with me — and one that continues to surface in many data quality investigations — is the importance of managing joins carefully. Especially in insurance data, where you’re often working with one-to-many relationships, it's easy to inflate totals or duplicate rows without realizing it.

In this post, I’ll walk through a real-world scenario I encountered while reporting on policy premiums. Everything looked fine until I joined the policies table to quote_events and quote_event_venues, and suddenly my total premiums were triple what they should have been.

The Problem: Multiplying Rows With Naive Joins

Let’s say you’re working with these tables:

policies: each row represents a bound transaction (ADD, CHG, CNL)
quote_events: one policy may have multiple events
quote_event_venues: one event may have multiple venues

At first glance, a simple join across these tables seems fine:

SELECT p.policy_number, p.premium, v.venue_name
FROM policies p
JOIN quote_events e ON p.policy_id = e.policy_id
JOIN quote_event_venues v ON e.event_id = v.event_id;

This query works — but it multiplies the number of rows. If a policy has two events and each event has three venues, you get six rows for one policy. That’s fine if you’re looking at venue-level data, but if you're aggregating premium, you've got a serious problem.

The Symptom: Inflated Premium Totals

Let’s say you try to calculate total premium per policy:

SELECT p.policy_number, SUM(p.premium)
FROM policies p
JOIN quote_events e ON p.policy_id = e.policy_id
JOIN quote_event_venues v ON e.event_id = v.event_id
GROUP BY p.policy_number;

Suddenly, your numbers are way higher than expected. You start second-guessing the data itself. But the truth is — your join is the problem.

Another related problem is the assumption that every policy’s premium is correctly represented by a single row. But in systems where ADD and CHG transactions both contribute to the policy history, a simple SUM(premium) can mislead.

If a policy had an initial ADD transaction for $500 and a subsequent CHG that raised it to $800, and both rows are included in the aggregation, the result could show $1,300 instead of the final premium of $800.

The Fix: Aggregate Before You Join

The solution is often to reduce the granularity before performing joins — especially if you don’t actually need row-level detail from the child tables.

Here’s a safer approach using subqueries or CTEs:

WITH venue_summary AS (
  SELECT e.policy_id,
         COUNT(DISTINCT v.venue_name) AS venue_count
  FROM quote_events e
  JOIN quote_event_venues v ON e.event_id = v.event_id
  GROUP BY e.policy_id
),
latest_premiums AS (
  SELECT policy_id, policy_number, premium
  FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY policy_id ORDER BY creation_date DESC) AS rn
    FROM policies
  ) t
  WHERE rn = 1
)
SELECT lp.policy_number, lp.premium, vs.venue_count
FROM latest_premiums lp
LEFT JOIN venue_summary vs ON lp.policy_id = vs.policy_id;

This structure ensures that we’re only pulling the most recent version of each policy’s premium, avoiding overcounting due to multiple transactions.

Here is an example of what this would look like:

policy_number	premium	venue_count
ABC100001	101	2
ABC100002	424	1
ABC100003	182	3
ABC100004	568	0
ABC100005	530	1

Practical Use Cases

Generate accurate reports on earned premiums
Reconcile submission volumes against actual bound policies
Pinpoint which CHG transactions affected policy financials
Prevent inflated row counts in exports or downstream reports

Key Takeaways

This is a common trap — not because the SQL is wrong, but because the data grain shifts quietly. One-to-many joins can multiply rows without warning, especially in complex policy data.

Be mindful of row multiplication when joining to detail tables
Control the data grain before aggregating
Use CTEs or subqueries to summarize before joining
Validate the joins by testing against a known single record