It's Monday morning. You get a message: "The CRM export for January 15th had bad data. We've fixed it in the source system. Can you reload?"
Most teams handle this poorly — not because they do the wrong thing on purpose, but because they don't realize Airflow already has the right tool built in.
The Problem: Source Data Was Corrected
Your daily pipeline ran on January 16th, processing data for the logical date of January 15th. The satellite captured this customer record:
| CUSTOMER_HK | HASHDIFF | NAME | TIER | LOAD_DATETIME | EFFECTIVE_FROM |
|---|---|---|---|---|---|
a1b2c3... |
x7y8z9... |
Acme Corp | Silver | 2026-01-16 06:00:00 | 2026-01-15 |
Now the source team tells you: "Acme Corp should be Gold tier, not Silver. We've fixed it." You need to reload. What do most teams do?
What Most Teams Do Wrong
In practice, teams rarely make a clean re-run. Instead, they reach for workarounds:
1. They do nothing
The most common reaction is "we'll catch it in the next run." But the next scheduled run processes tomorrow's data window, not January 15th. The correction sits in the source system and never reaches the vault. The bad data stays forever.
2. They build a manual fix
Someone writes a one-off script or a custom DAG with hardcoded parameters to reload just this one date. This works once, but:
- It's manual work every time a correction happens
- The custom DAG doesn't follow the same logic as the production pipeline — subtle differences creep in
- There's no record in Airflow's history that this logical date was reprocessed
- Next time someone has to figure out what happened, there's no audit trail
3. They use CURRENT_TIMESTAMP() as LOAD_DATETIME
Some teams do re-run the correct date, but their staging model stamps CURRENT_TIMESTAMP() or {{ ts }} (the logical date) as LOAD_DATETIME. This means:
- With
CURRENT_TIMESTAMP()— different models within the same run get different timestamps, making it impossible to trace which records belong to which DAG run - With
{{ ts }}— the re-run produces the exact sameLOAD_DATETIMEas the original run, so you can't tell the data was ever reloaded
All three approaches share the same root cause: not using what Airflow already gives you.
Rule #1: Every DAG Run Must Be Green
Before we talk about the right approach, there's a fundamental principle that makes all of this work:
In a Data Vault pipeline, every DAG run must end in success (green). A red (failed) run is not just an ops issue — it's a data gap.
Here's why:
- Each DAG run corresponds to one logical date — one business day of data
- A failed run means that day's data never reached the vault
- Your satellites have a hole — queries for that date return stale or missing data
- PIT tables produce wrong results for any date in the gap
If you use depends_on_past: True in your DAG (and you should), a failed run blocks all future runs. The pipeline stops until you fix it. This feels painful, but it's correct — it forces you to deal with data gaps immediately instead of accumulating silent errors.
The workflow is:
- DAG run fails → status is red
- You investigate and fix the root cause (source system down, credentials expired, schema changed, etc.)
- In Airflow UI, you Clear the failed task → Airflow re-runs it
- Same logical date, new
dag_run.start_date - Satellite HASHDIFF checks for changes → loads if needed
- DAG run goes green → pipeline continues with the next date
Airflow's Clear button is not just an ops convenience — it's the correct mechanism for maintaining your Data Vault's integrity. Your Airflow Grid View should be a wall of green. Every green square = that business date is fully loaded. Every red square = a data gap that needs immediate attention.
This same mechanism — Clear and re-run — is exactly what you use when source data is corrected. Which brings us to the right approach.
The Right Approach: Clear and Re-run
Back to our scenario: Acme Corp's tier was wrong on January 15th. The source team fixed it. You go to Airflow UI, find the January 15th DAG run, and Clear it. Airflow re-runs the pipeline for that date. Here's what happens:
- The
logical_dateis still January 15th - The
WHEREclause selects the same source data window — but now with the corrected data EFFECTIVE_FROM= January 15th (unchanged — correct business date)LOAD_DATETIME= March 14th 10:30 (new — when the correction was loaded)- The satellite's HASHDIFF detects that "Silver" changed to "Gold" and inserts a new record
Your satellite now looks like this:
| CUSTOMER_HK | HASHDIFF | TIER | LOAD_DATETIME | EFFECTIVE_FROM |
|---|---|---|---|---|
a1b2c3... |
x7y8z9... |
Silver | 2026-01-16 06:00:00 | 2026-01-15 |
a1b2c3... |
m4n5o6... |
Gold | 2026-03-14 10:30:00 | 2026-01-15 |
This tells the true story: "Customer was loaded as Silver on January 16th, but the data was corrected and reloaded on March 14th as Gold. Both records are effective from January 15th."
You have a complete audit trail. You can see what the data was, when it was corrected, and that both versions apply to the same business date.
What If Nothing Changed?
If you re-run a DAG but the source data hasn't actually changed, nothing happens. The satellite macro compares the incoming HASHDIFF against the latest record. Same hash = no insert. The re-run is harmless.
This is why re-running is always safe:
| Scenario | HASHDIFF | Result |
|---|---|---|
| Source data unchanged | Same as latest record | No insert — satellite unchanged |
| Source data corrected | Different from latest record | New row inserted with new LOAD_DATETIME |
The Three Columns You Need
This pattern works because of three distinct columns in your staging model, each serving a different purpose:
| Column | Airflow Source | Meaning | Changes on Re-run? |
|---|---|---|---|
| LOGICAL_DATE | {{ ds }} |
Which business day this batch belongs to | No |
| LOAD_DATETIME | {{ dag_run.start_date }} |
When the data was physically loaded into the vault | Yes — new timestamp each run |
| EFFECTIVE_FROM | {{ ds }} (logical date) |
From which business date this record is valid | No |
In your dbt staging model:
SELECT *,
TO_DATE('{{ var('logical_date') }}') AS LOGICAL_DATE,
TO_TIMESTAMP_NTZ('{{ var('load_datetime') }}') AS LOAD_DATETIME,
TO_DATE('{{ var('logical_date') }}') AS EFFECTIVE_FROM
FROM staging
And in your Airflow DAG:
dbt_staging = BashOperator(
task_id="dbt_staging",
bash_command=(
f"cd {DBT_DIR} && dbt run "
"--select tag:staging "
"--vars '"
'{"load_datetime": "'
'{{ dag_run.start_date.astimezone(dag.timezone)'
'.strftime("%Y-%m-%d %H:%M:%S") }}", '
'"logical_date": "'
'{{ execution_date.in_timezone("Europe/Moscow")'
'.strftime("%Y-%m-%d") }}"}'
"'"
),
)
Note the timezone handling: dag_run.start_date and execution_date are UTC by default. Convert them to your warehouse timezone to avoid confusion.
Backfill: When You Missed Multiple Days
What if your pipeline was down for 3 days? Use Airflow's backfill command:
airflow dags backfill data_vault_pipeline \
--start-date 2026-03-10 \
--end-date 2026-03-13 \
--reset-dagruns
Airflow creates one DAG run per logical date. Each run:
- Gets its own
logical_date(March 10, 11, 12) - Gets
dag_run.start_date= when the backfill actually runs (e.g., March 14th) - Selects the correct source data window for that day
- HASHDIFF prevents duplicates if any of those days were already partially loaded
The same principle applies: the logical date defines what data you're processing, dag_run.start_date records when you processed it.
Why This Matters for PIT Tables
Point-in-Time (PIT) tables answer questions like: "What did this customer look like on January 20th?"
A PIT query finds the latest satellite record where EFFECTIVE_FROM <= 2026-01-20. If EFFECTIVE_FROM was set to the LOAD_DATETIME (March 14th from the re-run), the PIT table would think the Gold tier only became effective in March — and would return Silver for January 20th. Wrong answer.
By using the logical date as EFFECTIVE_FROM, the PIT table correctly returns Gold for any query date on or after January 15th, regardless of when the correction was physically loaded.
When to Re-run vs. New Run vs. Backfill
| Scenario | Action | Why |
|---|---|---|
| Source data corrected for a past date | Re-run that date's DAG run | Same logical date, new LOAD_DATETIME, HASHDIFF detects changes |
| Pipeline failed mid-run | Re-run the failed DAG run | Picks up where it left off, no duplicates |
| Pipeline was down for multiple days | Backfill the missed range | Creates one run per logical date with correct windows |
| Normal daily processing | Scheduled run (automatic) | New logical date, new data window |
| Need to add a new source table retroactively | Backfill from the start date | Loads historical data with correct dates per period |
Summary
- Never create a new DAG run to fix past data — it assigns the wrong logical date and breaks your timeline
- Re-run the original DAG run — same logical date, same source window, new LOAD_DATETIME
- HASHDIFF makes re-runs safe — unchanged data is skipped, changed data gets a new satellite record
- EFFECTIVE_FROM must use the logical date — not LOAD_DATETIME — so PIT tables and timeline queries return correct results
- LOAD_DATETIME records when the data was physically loaded — giving you a complete audit trail of corrections and reloads
Need help building auditable Data Vault pipelines?
We design production-ready Data Vault 2.0 pipelines with Airflow and dbt, built for correctness from day one.
Get Started