PIT Tables in Data Vault 2.0: How to Query Satellites Without Duplicates

One of the most common mistakes when building Data Vault marts is joining satellites directly without filtering for the latest record. This produces duplicate rows and wrong results. This article explains why it happens, how to fix it, and how to build Point-in-Time (PIT) tables to solve it permanently.

The Problem: Satellite Joins Produce Duplicates

Satellites are insert-only. Every time a customer's name, email, or segment changes, a new row is added. The old row stays. This is what makes Data Vault auditable — but it also means a naive join returns too many rows.

-- sat_customer_details (insert-only history)
| CUSTOMER_HK | CUSTOMER_NAME | EMAIL            | LOAD_DATETIME       |
|-------------|---------------|------------------|---------------------|
| a1b2c3...   | John Smith    | john@old.com     | 2025-01-15 06:00:00 |
| a1b2c3...   | John Smith    | john@new.com     | 2025-06-01 06:00:00 |  -- email changed
| a1b2c3...   | John A. Smith | john@new.com     | 2026-01-10 06:00:00 |  -- name changed

-- WRONG: joining without filter
SELECT hub.CUSTOMER_ID, sat.CUSTOMER_NAME, sat.EMAIL
FROM hub_customer hub
JOIN sat_customer_details sat ON hub.CUSTOMER_HK = sat.CUSTOMER_HK;

-- Returns 3 rows for John Smith! Every historical version.
-- Your report shows the same customer 3 times.

This gets worse when you join multiple satellites. If a customer has 3 name changes and 4 address changes, you get 3 x 4 = 12 rows instead of 1.

Fix 1: ROW_NUMBER() Filter

The simplest fix — filter each satellite to only the latest record:

-- CORRECT: only the latest satellite record per customer
SELECT hub.CUSTOMER_ID, sat.CUSTOMER_NAME, sat.EMAIL
FROM hub_customer hub
JOIN sat_customer_details sat ON hub.CUSTOMER_HK = sat.CUSTOMER_HK
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY sat.CUSTOMER_HK
    ORDER BY sat.LOAD_DATETIME DESC
) = 1;

-- Returns 1 row: "John A. Smith", "john@new.com"
-- The latest version only.

This works, but it has problems when you join multiple satellites:

-- Joining 3 satellites — QUALIFY gets messy
SELECT
    hub.CUSTOMER_ID,
    det.CUSTOMER_NAME,
    addr.CITY,
    risk.RISK_SCORE
FROM hub_customer hub
JOIN sat_customer_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
JOIN sat_customer_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
JOIN sat_customer_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK
QUALIFY
    ROW_NUMBER() OVER (PARTITION BY det.CUSTOMER_HK ORDER BY det.LOAD_DATETIME DESC) = 1
AND ROW_NUMBER() OVER (PARTITION BY addr.CUSTOMER_HK ORDER BY addr.LOAD_DATETIME DESC) = 1
AND ROW_NUMBER() OVER (PARTITION BY risk.CUSTOMER_HK ORDER BY risk.LOAD_DATETIME DESC) = 1;

-- Works, but:
-- 1. Hard to read
-- 2. Slow — 3 window functions running on every row
-- 3. Gets worse with every satellite you add
-- 4. You repeat this pattern in every mart query

Fix 2: Subquery Approach

An alternative that works on all databases (not just Snowflake/BigQuery):

-- Pre-filter each satellite with a subquery
SELECT hub.CUSTOMER_ID, det.CUSTOMER_NAME, addr.CITY, risk.RISK_SCORE
FROM hub_customer hub

JOIN sat_customer_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
JOIN (
    SELECT CUSTOMER_HK, MAX(LOAD_DATETIME) AS MAX_LDTS
    FROM sat_customer_details GROUP BY CUSTOMER_HK
) det_latest
    ON det.CUSTOMER_HK = det_latest.CUSTOMER_HK
    AND det.LOAD_DATETIME = det_latest.MAX_LDTS

JOIN sat_customer_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
JOIN (
    SELECT CUSTOMER_HK, MAX(LOAD_DATETIME) AS MAX_LDTS
    FROM sat_customer_address GROUP BY CUSTOMER_HK
) addr_latest
    ON addr.CUSTOMER_HK = addr_latest.CUSTOMER_HK
    AND addr.LOAD_DATETIME = addr_latest.MAX_LDTS

JOIN sat_customer_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK
JOIN (
    SELECT CUSTOMER_HK, MAX(LOAD_DATETIME) AS MAX_LDTS
    FROM sat_customer_risk GROUP BY CUSTOMER_HK
) risk_latest
    ON risk.CUSTOMER_HK = risk_latest.CUSTOMER_HK
    AND risk.LOAD_DATETIME = risk_latest.MAX_LDTS;

-- Works, but now your query is 30 lines long for 3 satellites.
-- Imagine 8 satellites. This is not maintainable.

The Real Solution: PIT Tables (Point-in-Time)

A PIT table is a Business Vault table that pre-computes the latest satellite record for each hub entity. Instead of filtering satellites in every mart query, you do it once.

Building a PIT table in dbt

-- models/business_vault/pit_customer.sql

{{ config(materialized='table') }}

WITH hub AS (
    SELECT CUSTOMER_HK, CUSTOMER_ID
    FROM {{ ref('hub_customer') }}
),

-- Latest record from each satellite
latest_details AS (
    SELECT *
    FROM {{ ref('sat_customer_details') }}
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY CUSTOMER_HK
        ORDER BY LOAD_DATETIME DESC
    ) = 1
),

latest_address AS (
    SELECT *
    FROM {{ ref('sat_customer_address') }}
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY CUSTOMER_HK
        ORDER BY LOAD_DATETIME DESC
    ) = 1
),

latest_risk AS (
    SELECT *
    FROM {{ ref('sat_customer_risk') }}
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY CUSTOMER_HK
        ORDER BY LOAD_DATETIME DESC
    ) = 1
)

SELECT
    hub.CUSTOMER_HK,
    hub.CUSTOMER_ID,

    -- From sat_customer_details
    det.CUSTOMER_NAME,
    det.EMAIL,
    det.PHONE,
    det.LOAD_DATETIME AS DETAILS_LDTS,

    -- From sat_customer_address
    addr.STREET,
    addr.CITY,
    addr.STATE,
    addr.COUNTRY,
    addr.ZIP_CODE,
    addr.LOAD_DATETIME AS ADDRESS_LDTS,

    -- From sat_customer_risk
    risk.RISK_SCORE,
    risk.CREDIT_LIMIT,
    risk.RISK_CATEGORY,
    risk.LOAD_DATETIME AS RISK_LDTS,

    -- The most recent update across all satellites
    GREATEST(
        det.LOAD_DATETIME,
        addr.LOAD_DATETIME,
        risk.LOAD_DATETIME
    ) AS LAST_UPDATED

FROM hub
LEFT JOIN latest_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
LEFT JOIN latest_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
LEFT JOIN latest_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK

Now mart queries are simple

-- models/marts/dim_customer.sql
-- Clean, simple, no satellite filtering needed

SELECT
    CUSTOMER_ID,
    CUSTOMER_NAME,
    EMAIL,
    PHONE,
    CITY,
    STATE,
    COUNTRY,
    RISK_SCORE,
    RISK_CATEGORY,
    CREDIT_LIMIT,
    LAST_UPDATED
FROM {{ ref('pit_customer') }}
-- That's it. One line join. One row per customer.

-- models/marts/fct_sales.sql
-- Simple fact table using PIT

SELECT
    hub_sale.SALE_ID,
    pit_cust.CUSTOMER_NAME,
    pit_cust.CITY,
    pit_cust.RISK_SCORE,
    pit_prod.PRODUCT_NAME,
    pit_prod.CATEGORY,
    sat_sale.AMOUNT,
    sat_sale.SALE_DATE
FROM {{ ref('hub_sale') }} hub_sale
JOIN {{ ref('link_customer_sale') }} lcl ON hub_sale.SALE_HK = lcl.SALE_HK
JOIN {{ ref('pit_customer') }} pit_cust ON lcl.CUSTOMER_HK = pit_cust.CUSTOMER_HK
JOIN {{ ref('link_sale_product') }} lsp ON hub_sale.SALE_HK = lsp.SALE_HK
JOIN {{ ref('pit_product') }} pit_prod ON lsp.PRODUCT_HK = pit_prod.PRODUCT_HK
JOIN {{ ref('sat_sale_details') }} sat_sale ON hub_sale.SALE_HK = sat_sale.SALE_HK
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY sat_sale.SALE_HK
    ORDER BY sat_sale.LOAD_DATETIME DESC
) = 1

Advanced: PIT Table with As-Of Date

The PIT table above shows the current state. But what if you need the state at a specific date? For example: "What did we know about this customer on January 15th?"

This is a PIT table with an as-of date:

-- models/business_vault/pit_customer_as_of.sql
-- Shows the customer state at any given snapshot date

{{ config(materialized='table') }}

{% set as_of_date = var('as_of_date', 'CURRENT_DATE') %}

WITH hub AS (
    SELECT CUSTOMER_HK, CUSTOMER_ID
    FROM {{ ref('hub_customer') }}
),

latest_details AS (
    SELECT *
    FROM {{ ref('sat_customer_details') }}
    WHERE LOAD_DATETIME <= '{{ as_of_date }}'::TIMESTAMP
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY CUSTOMER_HK
        ORDER BY LOAD_DATETIME DESC
    ) = 1
),

latest_address AS (
    SELECT *
    FROM {{ ref('sat_customer_address') }}
    WHERE LOAD_DATETIME <= '{{ as_of_date }}'::TIMESTAMP
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY CUSTOMER_HK
        ORDER BY LOAD_DATETIME DESC
    ) = 1
),

latest_risk AS (
    SELECT *
    FROM {{ ref('sat_customer_risk') }}
    WHERE LOAD_DATETIME <= '{{ as_of_date }}'::TIMESTAMP
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY CUSTOMER_HK
        ORDER BY LOAD_DATETIME DESC
    ) = 1
)

SELECT
    hub.CUSTOMER_HK,
    hub.CUSTOMER_ID,
    det.CUSTOMER_NAME,
    det.EMAIL,
    addr.CITY,
    addr.COUNTRY,
    risk.RISK_SCORE,
    risk.RISK_CATEGORY,
    '{{ as_of_date }}'::DATE AS SNAPSHOT_DATE
FROM hub
LEFT JOIN latest_details det ON hub.CUSTOMER_HK = det.CUSTOMER_HK
LEFT JOIN latest_address addr ON hub.CUSTOMER_HK = addr.CUSTOMER_HK
LEFT JOIN latest_risk risk ON hub.CUSTOMER_HK = risk.CUSTOMER_HK

Run it with a specific date:

# What did we know about customers on January 15th?
dbt run --select pit_customer_as_of --vars '{"as_of_date": "2026-01-15"}'

# What did we know yesterday?
dbt run --select pit_customer_as_of --vars '{"as_of_date": "2026-03-10"}'

This is how you answer audit questions. "Show me every customer's risk score as of January 15th at the close of business" — one dbt run, one table, exact answer.

Using the dbtvault PIT Macro

The dbtvault package has a built-in pit macro that generates PIT tables automatically:

-- models/business_vault/pit_customer_auto.sql

{{ config(materialized='incremental') }}

{%- set yaml_metadata -%}
source_model: hub_customer
src_pk: CUSTOMER_HK
as_of_dates_table: as_of_date
satellites:
  SAT_CUSTOMER_DETAILS:
    pk:
      PK: CUSTOMER_HK
    ldts:
      LDTS: LOAD_DATETIME
  SAT_CUSTOMER_ADDRESS:
    pk:
      PK: CUSTOMER_HK
    ldts:
      LDTS: LOAD_DATETIME
  SAT_CUSTOMER_RISK:
    pk:
      PK: CUSTOMER_HK
    ldts:
      LDTS: LOAD_DATETIME
{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ dbtvault.pit(
    src_pk=metadata_dict["src_pk"],
    as_of_dates_table=metadata_dict["as_of_dates_table"],
    satellites=metadata_dict["satellites"],
    stage_tables_ldts=metadata_dict["satellites"],
    src_ldts="LOAD_DATETIME",
    source_model=metadata_dict["source_model"]
) }}

The dbtvault macro handles all the ROW_NUMBER() logic, as-of-date filtering, and incremental loading for you.

When to Use Each Approach

Approach	When to use	Pros	Cons
ROW_NUMBER() in mart	Quick queries, 1-2 satellites	Simple, no extra tables	Duplicated logic, slow with many sats
Subquery filter	Databases without QUALIFY	Works everywhere	Verbose, hard to maintain
Manual PIT table	3+ satellites per hub	Clean marts, fast queries, full control	Extra table to maintain
PIT with as-of date	Audit, compliance, historical analysis	Answer "what did we know at time X?"	Needs date parameter
dbtvault PIT macro	Production Data Vaults	Automated, incremental, battle-tested	Requires dbtvault package setup

Summary

Never join raw satellites without filtering — you'll get duplicate rows
Use ROW_NUMBER() / QUALIFY for simple, one-off queries
Build PIT tables when you have 3+ satellites per hub — filter once, use everywhere
Add as-of date support when you need historical snapshots for audits or compliance
Use the dbtvault PIT macro in production for automated, incremental PIT generation

PIT tables are the bridge between the Raw Vault (optimized for loading) and your marts (optimized for querying). They belong in the Business Vault layer and should be part of every Data Vault implementation.

Need help building your Data Vault?

We design and implement Data Vaults with proper PIT tables, Business Vault patterns, and clean marts — so your team gets fast, accurate queries from day one.

Get Started