One of the most common mistakes when building Data Vault marts is joining satellites directly without filtering for the latest record. This produces duplicate rows and wrong results. This article explains why it happens, how to fix it, and how to build Point-in-Time (PIT) tables to solve it permanently.
The Problem: Satellite Joins Produce Duplicates
Satellites are insert-only. Every time a customer's name, email, or segment changes, a new row is added. The old row stays. This is what makes Data Vault auditable — but it also means a naive join returns too many rows.
-- sat_customer_details (insert-only history)
| CUSTOMER_HK | CUSTOMER_NAME | EMAIL | LOAD_DATETIME |
|-------------|---------------|------------------|---------------------|
| a1b2c3... | John Smith | john@old.com | 2025-01-15 06:00:00 |
| a1b2c3... | John Smith | john@new.com | 2025-06-01 06:00:00 | -- email changed
| a1b2c3... | John A. Smith | john@new.com | 2026-01-10 06:00:00 | -- name changed
-- WRONG: joining without filter
SELECT hub.CUSTOMER_ID, sat.CUSTOMER_NAME, sat.EMAIL
FROM hub_customer hub
JOIN sat_customer_details sat ON hub.CUSTOMER_HK = sat.CUSTOMER_HK;
-- Returns 3 rows for John Smith! Every historical version.
-- Your report shows the same customer 3 times.
This gets worse when you join multiple satellites. If a customer has 3 name changes and 4 address changes, you get 3 x 4 = 12 rows instead of 1.
Fix 1: ROW_NUMBER() Filter
The simplest fix — filter each satellite to only the latest record:
-- CORRECT: only the latest satellite record per customer
SELECT hub.CUSTOMER_ID, sat.CUSTOMER_NAME, sat.EMAIL
FROM hub_customer hub
JOIN sat_customer_details sat ON hub.CUSTOMER_HK = sat.CUSTOMER_HK
QUALIFY ROW_NUMBER() OVER (
PARTITION BY sat.CUSTOMER_HK
ORDER BY sat.LOAD_DATETIME DESC
) = 1;
-- Returns 1 row: "John A. Smith", "john@new.com"
-- The latest version only.
This works, but it has problems when you join multiple satellites:
-- Joining 3 satellites — QUALIFY gets messy
SELECT
hub.CUSTOMER_ID,
det.CUSTOMER_NAME,
addr.CITY,
risk.RISK_SCORE
FROM hub_customer hub
JOIN sat_customer_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
JOIN sat_customer_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
JOIN sat_customer_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK
QUALIFY
ROW_NUMBER() OVER (PARTITION BY det.CUSTOMER_HK ORDER BY det.LOAD_DATETIME DESC) = 1
AND ROW_NUMBER() OVER (PARTITION BY addr.CUSTOMER_HK ORDER BY addr.LOAD_DATETIME DESC) = 1
AND ROW_NUMBER() OVER (PARTITION BY risk.CUSTOMER_HK ORDER BY risk.LOAD_DATETIME DESC) = 1;
-- Works, but:
-- 1. Hard to read
-- 2. Slow — 3 window functions running on every row
-- 3. Gets worse with every satellite you add
-- 4. You repeat this pattern in every mart query
Fix 2: Subquery Approach
An alternative that works on all databases (not just Snowflake/BigQuery):
-- Pre-filter each satellite with a subquery
SELECT hub.CUSTOMER_ID, det.CUSTOMER_NAME, addr.CITY, risk.RISK_SCORE
FROM hub_customer hub
JOIN sat_customer_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
JOIN (
SELECT CUSTOMER_HK, MAX(LOAD_DATETIME) AS MAX_LDTS
FROM sat_customer_details GROUP BY CUSTOMER_HK
) det_latest
ON det.CUSTOMER_HK = det_latest.CUSTOMER_HK
AND det.LOAD_DATETIME = det_latest.MAX_LDTS
JOIN sat_customer_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
JOIN (
SELECT CUSTOMER_HK, MAX(LOAD_DATETIME) AS MAX_LDTS
FROM sat_customer_address GROUP BY CUSTOMER_HK
) addr_latest
ON addr.CUSTOMER_HK = addr_latest.CUSTOMER_HK
AND addr.LOAD_DATETIME = addr_latest.MAX_LDTS
JOIN sat_customer_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK
JOIN (
SELECT CUSTOMER_HK, MAX(LOAD_DATETIME) AS MAX_LDTS
FROM sat_customer_risk GROUP BY CUSTOMER_HK
) risk_latest
ON risk.CUSTOMER_HK = risk_latest.CUSTOMER_HK
AND risk.LOAD_DATETIME = risk_latest.MAX_LDTS;
-- Works, but now your query is 30 lines long for 3 satellites.
-- Imagine 8 satellites. This is not maintainable.
The Real Solution: PIT Tables (Point-in-Time)
A PIT table is a Business Vault table that pre-computes the latest satellite record for each hub entity. Instead of filtering satellites in every mart query, you do it once.
Building a PIT table in dbt
-- models/business_vault/pit_customer.sql
{{ config(materialized='table') }}
WITH hub AS (
SELECT CUSTOMER_HK, CUSTOMER_ID
FROM {{ ref('hub_customer') }}
),
-- Latest record from each satellite
latest_details AS (
SELECT *
FROM {{ ref('sat_customer_details') }}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_HK
ORDER BY LOAD_DATETIME DESC
) = 1
),
latest_address AS (
SELECT *
FROM {{ ref('sat_customer_address') }}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_HK
ORDER BY LOAD_DATETIME DESC
) = 1
),
latest_risk AS (
SELECT *
FROM {{ ref('sat_customer_risk') }}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_HK
ORDER BY LOAD_DATETIME DESC
) = 1
)
SELECT
hub.CUSTOMER_HK,
hub.CUSTOMER_ID,
-- From sat_customer_details
det.CUSTOMER_NAME,
det.EMAIL,
det.PHONE,
det.LOAD_DATETIME AS DETAILS_LDTS,
-- From sat_customer_address
addr.STREET,
addr.CITY,
addr.STATE,
addr.COUNTRY,
addr.ZIP_CODE,
addr.LOAD_DATETIME AS ADDRESS_LDTS,
-- From sat_customer_risk
risk.RISK_SCORE,
risk.CREDIT_LIMIT,
risk.RISK_CATEGORY,
risk.LOAD_DATETIME AS RISK_LDTS,
-- The most recent update across all satellites
GREATEST(
det.LOAD_DATETIME,
addr.LOAD_DATETIME,
risk.LOAD_DATETIME
) AS LAST_UPDATED
FROM hub
LEFT JOIN latest_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
LEFT JOIN latest_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
LEFT JOIN latest_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK
Now mart queries are simple
-- models/marts/dim_customer.sql
-- Clean, simple, no satellite filtering needed
SELECT
CUSTOMER_ID,
CUSTOMER_NAME,
EMAIL,
PHONE,
CITY,
STATE,
COUNTRY,
RISK_SCORE,
RISK_CATEGORY,
CREDIT_LIMIT,
LAST_UPDATED
FROM {{ ref('pit_customer') }}
-- That's it. One line join. One row per customer.
-- models/marts/fct_sales.sql
-- Simple fact table using PIT
SELECT
hub_sale.SALE_ID,
pit_cust.CUSTOMER_NAME,
pit_cust.CITY,
pit_cust.RISK_SCORE,
pit_prod.PRODUCT_NAME,
pit_prod.CATEGORY,
sat_sale.AMOUNT,
sat_sale.SALE_DATE
FROM {{ ref('hub_sale') }} hub_sale
JOIN {{ ref('link_customer_sale') }} lcl ON hub_sale.SALE_HK = lcl.SALE_HK
JOIN {{ ref('pit_customer') }} pit_cust ON lcl.CUSTOMER_HK = pit_cust.CUSTOMER_HK
JOIN {{ ref('link_sale_product') }} lsp ON hub_sale.SALE_HK = lsp.SALE_HK
JOIN {{ ref('pit_product') }} pit_prod ON lsp.PRODUCT_HK = pit_prod.PRODUCT_HK
JOIN {{ ref('sat_sale_details') }} sat_sale ON hub_sale.SALE_HK = sat_sale.SALE_HK
QUALIFY ROW_NUMBER() OVER (
PARTITION BY sat_sale.SALE_HK
ORDER BY sat_sale.LOAD_DATETIME DESC
) = 1
Advanced: PIT Table with As-Of Date
The PIT table above shows the current state. But what if you need the state at a specific date? For example: "What did we know about this customer on January 15th?"
This is a PIT table with an as-of date:
-- models/business_vault/pit_customer_as_of.sql
-- Shows the customer state at any given snapshot date
{{ config(materialized='table') }}
{% set as_of_date = var('as_of_date', 'CURRENT_DATE') %}
WITH hub AS (
SELECT CUSTOMER_HK, CUSTOMER_ID
FROM {{ ref('hub_customer') }}
),
latest_details AS (
SELECT *
FROM {{ ref('sat_customer_details') }}
WHERE LOAD_DATETIME <= '{{ as_of_date }}'::TIMESTAMP
QUALIFY ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_HK
ORDER BY LOAD_DATETIME DESC
) = 1
),
latest_address AS (
SELECT *
FROM {{ ref('sat_customer_address') }}
WHERE LOAD_DATETIME <= '{{ as_of_date }}'::TIMESTAMP
QUALIFY ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_HK
ORDER BY LOAD_DATETIME DESC
) = 1
),
latest_risk AS (
SELECT *
FROM {{ ref('sat_customer_risk') }}
WHERE LOAD_DATETIME <= '{{ as_of_date }}'::TIMESTAMP
QUALIFY ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_HK
ORDER BY LOAD_DATETIME DESC
) = 1
)
SELECT
hub.CUSTOMER_HK,
hub.CUSTOMER_ID,
det.CUSTOMER_NAME,
det.EMAIL,
addr.CITY,
addr.COUNTRY,
risk.RISK_SCORE,
risk.RISK_CATEGORY,
'{{ as_of_date }}'::DATE AS SNAPSHOT_DATE
FROM hub
LEFT JOIN latest_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
LEFT JOIN latest_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
LEFT JOIN latest_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK
Run it with a specific date:
# What did we know about customers on January 15th?
dbt run --select pit_customer_as_of --vars '{"as_of_date": "2026-01-15"}'
# What did we know yesterday?
dbt run --select pit_customer_as_of --vars '{"as_of_date": "2026-03-10"}'
This is how you answer audit questions. "Show me every customer's risk score as of January 15th at the close of business" — one dbt run, one table, exact answer.
Using the dbtvault PIT Macro
The dbtvault package has a built-in pit macro that generates PIT tables automatically:
-- models/business_vault/pit_customer_auto.sql
{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
source_model: hub_customer
src_pk: CUSTOMER_HK
as_of_dates_table: as_of_date
satellites:
SAT_CUSTOMER_DETAILS:
pk:
PK: CUSTOMER_HK
ldts:
LDTS: LOAD_DATETIME
SAT_CUSTOMER_ADDRESS:
pk:
PK: CUSTOMER_HK
ldts:
LDTS: LOAD_DATETIME
SAT_CUSTOMER_RISK:
pk:
PK: CUSTOMER_HK
ldts:
LDTS: LOAD_DATETIME
{%- endset -%}
{% set metadata_dict = fromyaml(yaml_metadata) %}
{{ dbtvault.pit(
src_pk=metadata_dict["src_pk"],
as_of_dates_table=metadata_dict["as_of_dates_table"],
satellites=metadata_dict["satellites"],
stage_tables_ldts=metadata_dict["satellites"],
src_ldts="LOAD_DATETIME",
source_model=metadata_dict["source_model"]
) }}
The dbtvault macro handles all the ROW_NUMBER() logic, as-of-date filtering, and incremental loading for you.
When to Use Each Approach
| Approach | When to use | Pros | Cons |
|---|---|---|---|
| ROW_NUMBER() in mart | Quick queries, 1-2 satellites | Simple, no extra tables | Duplicated logic, slow with many sats |
| Subquery filter | Databases without QUALIFY | Works everywhere | Verbose, hard to maintain |
| Manual PIT table | 3+ satellites per hub | Clean marts, fast queries, full control | Extra table to maintain |
| PIT with as-of date | Audit, compliance, historical analysis | Answer "what did we know at time X?" | Needs date parameter |
| dbtvault PIT macro | Production Data Vaults | Automated, incremental, battle-tested | Requires dbtvault package setup |
Summary
- Never join raw satellites without filtering — you'll get duplicate rows
- Use ROW_NUMBER() / QUALIFY for simple, one-off queries
- Build PIT tables when you have 3+ satellites per hub — filter once, use everywhere
- Add as-of date support when you need historical snapshots for audits or compliance
- Use the dbtvault PIT macro in production for automated, incremental PIT generation
PIT tables are the bridge between the Raw Vault (optimized for loading) and your marts (optimized for querying). They belong in the Business Vault layer and should be part of every Data Vault implementation.
Need help building your Data Vault?
We design and implement Data Vaults with proper PIT tables, Business Vault patterns, and clean marts — so your team gets fast, accurate queries from day one.
Get Started