Staging is the transformation layer between raw source data and the Data Vault. These are dbt views — no data is physically stored. They compute everything the vault needs on the fly.
What Staging Does
- Hash keys (HK) — MD5 hashes of business keys.
hub_customerneedscustomer_hk = MD5(customer_id). Hash keys are the join keys everywhere in the vault. - Hashdiffs (HD) — MD5 hashes of descriptive columns. Used by satellites to detect changes — if the hashdiff changes, a new satellite row is inserted.
- Composite keys — Hash keys for links combine multiple business keys:
link_order_customer_hk = MD5(order_id || customer_id) - Load metadata —
load_datetimeandrecord_sourcecolumns added to every row.
Staging Models
stg_customers — customer_hk, hashdiff for details + contact stg_products — product_hk, hashdiff for product attributes stg_orders — order_hk, customer_hk, hashdiff for status + financials stg_order_lines — order_line_hk, order_hk, product_hk stg_invoices — invoice_hk, order_hk, contract_hk, customer_hk stg_payments — payment_hk, invoice_hk, contract_hk, customer_hk stg_warehouses — warehouse_hk stg_managers — manager_hk stg_contracts — contract_hk, customer_hk, manager_hk
Why Views?
Staging models are materialized as views because they don't need to persist data. They are a virtual transformation that runs when downstream models read from them. This saves storage and avoids maintaining another copy of the data.
automate_dv Macros
We use automate_dv (formerly dbtvault) macros for staging. The stage macro generates all hashing, column mapping, and metadata in a single YAML config — no manual SQL needed.