One of the most common Data Vault modeling questions is: "This thing in my source exists — does it need its own hub?" The answer is almost always decided by one rule: does it have a business key? If yes, it's a hub. If no, its descriptive data belongs on a link satellite. Get this wrong and you'll spend months fighting fan-out joins, duplicate rows, and broken audit trails.
This article walks through the rule using a real example from the FMCG sales demo — order lines. It shows when an entity earns its own hub, when it doesn't, and exactly what goes wrong when you put the descriptive data in the wrong place.
The Rule: No Business Key, No Hub
In Data Vault 2.0, a hub represents a business concept that the enterprise identifies by a business key (BK) — something the business uses to refer to the entity independently. Customer number, product SKU, invoice number, order ID. These are BKs.
A link represents a relationship between two or more hubs. It records the fact that these entities are associated.
A satellite stores descriptive attributes — attached either to a hub (describing the entity) or to a link (describing the relationship).
So the question "does this need a hub?" reduces to: does the business identify this thing with its own key?
Entity has a BK? → HUB + SATELLITE
Entity has no BK? → LINK + LINK SATELLITE
That's the whole rule. The interesting part is recognizing which case you're in.
The FMCG Example: Order Line Has a BK
In the FMCG demo, an order has one or more order lines. Each line represents one product, at one quantity, at one unit price, inside one order. The source system assigns each line its own identifier — line_id.
-- Source: erp.order_lines
| line_id | order_id | product_sku | qty | unit_price |
|---------|----------|-------------|-----|------------|
| 10001 | ORD-77 | SKU-A | 5 | 2.49 |
| 10002 | ORD-77 | SKU-B | 12 | 1.20 |
| 10003 | ORD-77 | SKU-A | 3 | 2.49 | -- same SKU, different line
| 10004 | ORD-78 | SKU-A | 1 | 2.49 |
Notice row 3: the same order contains SKU-A twice (maybe one is a promotional unit, or a line was split for pricing). Because each line has its own line_id, the source can tell them apart.
That line_id is the business key. It means order_line is its own thing the business identifies — so it gets a hub:
-- models/raw_vault/hubs/hub_order_line.sql
{% raw %}{%- set source_model = "stg_order_lines" -%}
{%- set src_pk = "ORDER_LINE_HK" -%}
{%- set src_nk = "line_id" -%}
{%- set src_ldts = "LOAD_DATETIME" -%}
{%- set src_source = "RECORD_SOURCE" -%}
{{ automate_dv.hub(
src_pk=src_pk,
src_nk=src_nk,
src_ldts=src_ldts,
src_source=src_source,
source_model=source_model
) }}{% endraw %}
With hub_order_line in place, the model is clean:
The satellite sat_order_line hangs off hub_order_line — it describes the line, not the relationship. Two links connect the line to its order and its product. Everything is keyed on the line's hash key, which comes from line_id.
Query the quantity for a specific line:
SELECT h.line_id, s.qty, s.unit_price
FROM hub_order_line h
JOIN sat_order_line s ON h.order_line_hk = s.order_line_hk
WHERE h.line_id = '10003';
-- Returns exactly one row. Clean.
What If There Was No line_id?
Now imagine a worse source. The ERP only exports order headers and a flat list of products per order — no line identifier. All you have is:
-- Source: erp.order_items (no line_id!)
| order_id | product_sku | qty | unit_price |
|----------|-------------|-----|------------|
| ORD-77 | SKU-A | 5 | 2.49 |
| ORD-77 | SKU-B | 12 | 1.20 |
| ORD-77 | SKU-A | 3 | 2.49 | -- same SKU, different line? no way to tell
| ORD-78 | SKU-A | 1 | 2.49 |
The business doesn't identify "a line" independently here. The only thing that exists is the relationship between an order and a product, plus some attributes (quantity, unit price).
In this case there is no hub for order_line. The order–product relationship is already captured by a link:
-- lnk_order_product (the relationship itself)
| ORDER_PRODUCT_HK | ORDER_HK | PRODUCT_HK | LOAD_DATETIME |
|------------------|----------|------------|---------------|
| hash(ORD-77,SKU-A) | ORD-77 | SKU-A | 2026-04-16 06:00:00 |
| hash(ORD-77,SKU-B) | ORD-77 | SKU-B | 2026-04-16 06:00:00 |
| hash(ORD-78,SKU-A) | ORD-78 | SKU-A | 2026-04-16 06:00:00 |
And the descriptive attributes — quantity, unit price — attach as a link satellite:
-- sat_order_product (link satellite — describes the relationship)
| ORDER_PRODUCT_HK | QTY | UNIT_PRICE | HASHDIFF | LOAD_DATETIME |
|--------------------|-----|------------|----------|---------------------|
| hash(ORD-77,SKU-A) | 8 | 2.49 | h1 | 2026-04-16 06:00:00 |
| hash(ORD-77,SKU-B) | 12 | 1.20 | h2 | 2026-04-16 06:00:00 |
| hash(ORD-78,SKU-A) | 1 | 2.49 | h3 | 2026-04-16 06:00:00 |
That's the link satellite pattern. No hub for the intersection — the link itself is the thing, and the satellite describes the relationship.
The Fan-Out Trap
Here's where the rule earns its keep. Look at the no-BK example again — the source had two lines of SKU-A on order ORD-77 (quantities 5 and 3, total 8). The link satellite stores quantity 8 — not 5 and 3 separately.
Why? Because the link is (order_hk, product_hk). There is exactly one row in the link for every unique order+product pair. If the source lists SKU-A on ORD-77 twice, your load job has to decide: sum the quantities, pick one, or raise an error. The Data Vault model cannot represent two distinct rows for the same pair, because the hash key would collide.
This is the fan-out trap. It's usually caught late — after someone runs the first revenue report and the numbers don't match the source by a few percent. The model silently absorbed the duplicates.
The exact same data with a line_id works perfectly:
-- With line_id BK → hub + 2 rows
hub_order_line: line_id=10001, line_id=10003
sat_order_line: qty=5, qty=3
(no collision, both lines preserved)
-- Without line_id → link sat must collapse
lnk_order_product: (ORD-77, SKU-A) -- one row only
sat_order_product: qty=8 -- fused, detail lost
Rule of thumb: if the source can produce multiple distinct facts for the same combination of hubs, you need a business key to tell them apart. No BK means no hub means no way to distinguish them.
Recognizing Degenerate Keys
Sometimes the BK is hiding in plain sight. If the source doesn't give you a line_id but each line is uniquely determined by something like (order_id, line_sequence), that composite is your BK. Don't throw it away.
-- Source has line_sequence instead of line_id
| order_id | line_seq | product_sku | qty |
|----------|----------|-------------|-----|
| ORD-77 | 1 | SKU-A | 5 |
| ORD-77 | 2 | SKU-B | 12 |
| ORD-77 | 3 | SKU-A | 3 |
-- Business key = order_id || '|' || line_seq
-- This IS a BK. Build hub_order_line on it.
The test isn't "is there one column called id?" It's "can the business (and the source) uniquely identify this entity?" If yes, you have a BK — even if you have to build it from two or three source columns.
Link Satellite vs. Hub+Satellite: Real Differences
Both patterns store descriptive data. Both are insert-only. Both use LOAD_DATETIME and HASHDIFF. So what actually changes?
| Concern | Hub + Satellite | Link Satellite |
|---|---|---|
| Entity addressable by BK? | Yes — lookup by line_id |
No — only by composite of parent hubs |
| Multiple rows per hub combination? | Yes — each hub PK is unique | No — link PK collides on duplicates |
| Can attach more relationships? | Easy — build new link on existing hub | Hard — have to introduce a hub later |
| Historical changes to descriptors | New rows in sat, keyed on hub_hk | New rows in link sat, keyed on link_hk |
| Mart query shape | JOIN hub → sat directly | JOIN two hubs via link → sat |
| Typical use case | Invoices, orders, products, customers | Pricing per (customer, product), contract terms per (party, account) |
The third row matters most in practice. Once you have a hub, you can attach as many links and satellites as you want later — the hub is stable. Once you commit to "this is a link satellite," adding any new relationship requires refactoring: you have to introduce a hub, migrate the link sat into it, and update every downstream query.
When a Link Satellite Is Genuinely the Right Answer
Link satellites aren't a second-class citizen. They're the correct model when the descriptive data really belongs to the relationship, not to an entity. Classic examples:
- Pricing per customer+product. A customer has a negotiated price for a specific product. There's no "price entity" — the price is a fact about the relationship. Link sat on
lnk_customer_product. - Contract terms per party+account. A counterparty has specific terms for each account they hold. The terms describe the relationship, not a standalone thing. Link sat on
lnk_party_account. - Effectivity. When and how long a relationship is valid — start/end dates. This is explicitly an attribute of the relationship itself. This is why
eff_sat(effectivity satellite) is almost always a link satellite. - Same-as and master-hub matching scores. The confidence score between two hub records belongs to the match, not to either record. Link sat on the same-as link. See the FMCG same-as link for a live example.
In each of these, the source has no BK for the intersection — and the business doesn't treat the intersection as its own entity. The link is the thing.
Decision Framework
When you're looking at a new source entity, walk through these questions in order:
Checklist
- Can I name a business key? Single column or composite — it counts either way. If yes → hub.
- If no BK, can the source produce multiple rows per parent-hub combo? If yes, you're in fan-out territory — go back to the source team. You need a key, even a sequence number.
- If no BK and no duplicates are possible, the descriptive data describes the relationship — use a link satellite.
- Does the business treat this as an independent entity? If the business names it, references it in reports, assigns ownership to it — that's a strong signal there should be a BK somewhere. Find it.
- Will anything else link to this entity later? If yes, you'll need a hub eventually. Better to build it now than refactor.
Common Mistakes
1. Creating a hub with a surrogate key as the BK
Sometimes teams try to force a hub by using a surrogate sequence from the staging layer as the BK. This works mechanically but breaks the audit trail — you can't trace a hub record back to a business identifier, and you can't dedupe across sources. If the source doesn't have a natural key, the answer is usually a link satellite, not a fake hub.
2. Putting transactional attributes on a hub satellite when they describe a relationship
A classic: putting "discount percentage" on sat_customer when it's actually negotiated per-product. The attribute has nothing to do with the customer alone — it describes (customer, product). That's a link sat.
3. Not catching fan-out in testing
The fan-out trap doesn't throw errors. It silently collapses rows. Always include a test that compares COUNT(*) between the source feed and the loaded link satellite. If they don't match, you have a key problem.
-- dbt test: catch fan-out on link load
SELECT
(SELECT COUNT(*) FROM {% raw %}{{ ref('stg_order_items') }}{% endraw %}) AS source_rows,
(SELECT COUNT(*) FROM {% raw %}{{ ref('lnk_order_product') }}{% endraw %}) AS link_rows
-- If source_rows > link_rows, you're losing data to hash collisions.
Summary
- Business key decides it. BK → hub. No BK → link satellite.
- Composite keys count.
(order_id, line_seq)is still a BK. Build the hub on the composite. - Fan-out trap is real. No BK + multiple distinct rows per hub combo = silent data loss. Insist on a key from the source.
- Link satellites are the right answer for pricing per (customer, product), contract terms per (party, account), effectivity, and match scores. Don't force a hub.
- Hubs are easier to extend later than link satellites. When in doubt and a BK is plausible, favor the hub.
The FMCG demo at dbtvault-solutions.tech/demo/fmcg-sales/schema/ shows both patterns live — hub_order_line uses the BK pattern, and the lnk_customer_same_as link satellite shows the relationship-only pattern. Click any table on the schema page to see the dbt model.
Need help modeling a tricky source?
Most Data Vault modeling pain comes from ambiguous business keys and fan-out. We help teams get the hub/link/satellite decision right from day one — so the first marts are clean and the audit trail is complete.
Get Started