Staging — FMCG Pipeline | Data Vault 2.0

Staging

Staging is the transformation layer between raw source data and the Data Vault. These are dbt views — no data is physically stored. They compute everything the vault needs on the fly.

What Staging Does

Hash keys (HK) — MD5 hashes of business keys. hub_customer needs customer_hk = MD5(customer_id). Hash keys are the join keys everywhere in the vault.
Hashdiffs (HD) — MD5 hashes of descriptive columns. Used by satellites to detect changes — if the hashdiff changes, a new satellite row is inserted.
Composite keys — Hash keys for links combine multiple business keys: link_order_customer_hk = MD5(order_id || customer_id)
Load metadata — load_datetime and record_source columns added to every row.

Staging Models

stg_customers      — customer_hk, hashdiff for details + contact
stg_products       — product_hk, hashdiff for product attributes
stg_orders         — order_hk, customer_hk, hashdiff for status + financials
stg_order_lines    — order_line_hk, order_hk, product_hk
stg_invoices       — invoice_hk, order_hk, contract_hk, customer_hk
stg_payments       — payment_hk, invoice_hk, contract_hk, customer_hk
stg_warehouses     — warehouse_hk
stg_managers       — manager_hk
stg_contracts      — contract_hk, customer_hk, manager_hk

Why Views?

Staging models are materialized as views because they don't need to persist data. They are a virtual transformation that runs when downstream models read from them. This saves storage and avoids maintaining another copy of the data.

automate_dv Macros

We use automate_dv (formerly dbtvault) macros for staging. The stage macro generates all hashing, column mapping, and metadata in a single YAML config — no manual SQL needed.

← ERP Source Hubs →

Want a pipeline like this on your warehouse?

Build and maintain production Data Vault pipelines on dbt + Airflow.

Get in touch

Connect on LinkedIn