It's Monday morning. You get a message: "The CRM export for January 15th had bad data. We've fixed it in the source system. Can you reload?"

Most teams handle this poorly — not because they do the wrong thing on purpose, but because they don't realize Airflow already has the right tool built in.

The Problem: Source Data Was Corrected

Your daily pipeline ran on January 16th, processing data for the logical date of January 15th. The satellite captured this customer record:

CUSTOMER_HK HASHDIFF NAME TIER LOAD_DATETIME EFFECTIVE_FROM
a1b2c3... x7y8z9... Acme Corp Silver 2026-01-16 06:00:00 2026-01-15

Now the source team tells you: "Acme Corp should be Gold tier, not Silver. We've fixed it." You need to reload. What do most teams do?

What Most Teams Do Wrong

In practice, teams rarely make a clean re-run. Instead, they reach for workarounds:

1. They do nothing

The most common reaction is "we'll catch it in the next run." But the next scheduled run processes tomorrow's data window, not January 15th. The correction sits in the source system and never reaches the vault. The bad data stays forever.

2. They build a manual fix

Someone writes a one-off script or a custom DAG with hardcoded parameters to reload just this one date. This works once, but:

  • It's manual work every time a correction happens
  • The custom DAG doesn't follow the same logic as the production pipeline — subtle differences creep in
  • There's no record in Airflow's history that this logical date was reprocessed
  • Next time someone has to figure out what happened, there's no audit trail

3. They use CURRENT_TIMESTAMP() as LOAD_DATETIME

Some teams do re-run the correct date, but their staging model stamps CURRENT_TIMESTAMP() or {{ ts }} (the logical date) as LOAD_DATETIME. This means:

  • With CURRENT_TIMESTAMP() — different models within the same run get different timestamps, making it impossible to trace which records belong to which DAG run
  • With {{ ts }} — the re-run produces the exact same LOAD_DATETIME as the original run, so you can't tell the data was ever reloaded

All three approaches share the same root cause: not using what Airflow already gives you.

Rule #1: Every DAG Run Must Be Green

Before we talk about the right approach, there's a fundamental principle that makes all of this work:

In a Data Vault pipeline, every DAG run must end in success (green). A red (failed) run is not just an ops issue — it's a data gap.

Here's why:

  • Each DAG run corresponds to one logical date — one business day of data
  • A failed run means that day's data never reached the vault
  • Your satellites have a hole — queries for that date return stale or missing data
  • PIT tables produce wrong results for any date in the gap

If you use depends_on_past: True in your DAG (and you should), a failed run blocks all future runs. The pipeline stops until you fix it. This feels painful, but it's correct — it forces you to deal with data gaps immediately instead of accumulating silent errors.

The workflow is:

  1. DAG run fails → status is red
  2. You investigate and fix the root cause (source system down, credentials expired, schema changed, etc.)
  3. In Airflow UI, you Clear the failed task → Airflow re-runs it
  4. Same logical date, new dag_run.start_date
  5. Satellite HASHDIFF checks for changes → loads if needed
  6. DAG run goes green → pipeline continues with the next date

Airflow's Clear button is not just an ops convenience — it's the correct mechanism for maintaining your Data Vault's integrity. Your Airflow Grid View should be a wall of green. Every green square = that business date is fully loaded. Every red square = a data gap that needs immediate attention.

This same mechanism — Clear and re-run — is exactly what you use when source data is corrected. Which brings us to the right approach.

The Right Approach: Clear and Re-run

Back to our scenario: Acme Corp's tier was wrong on January 15th. The source team fixed it. You go to Airflow UI, find the January 15th DAG run, and Clear it. Airflow re-runs the pipeline for that date. Here's what happens:

  • The logical_date is still January 15th
  • The WHERE clause selects the same source data window — but now with the corrected data
  • EFFECTIVE_FROM = January 15th (unchanged — correct business date)
  • LOAD_DATETIME = March 14th 10:30 (new — when the correction was loaded)
  • The satellite's HASHDIFF detects that "Silver" changed to "Gold" and inserts a new record

Your satellite now looks like this:

CUSTOMER_HK HASHDIFF TIER LOAD_DATETIME EFFECTIVE_FROM
a1b2c3... x7y8z9... Silver 2026-01-16 06:00:00 2026-01-15
a1b2c3... m4n5o6... Gold 2026-03-14 10:30:00 2026-01-15

This tells the true story: "Customer was loaded as Silver on January 16th, but the data was corrected and reloaded on March 14th as Gold. Both records are effective from January 15th."

You have a complete audit trail. You can see what the data was, when it was corrected, and that both versions apply to the same business date.

What If Nothing Changed?

If you re-run a DAG but the source data hasn't actually changed, nothing happens. The satellite macro compares the incoming HASHDIFF against the latest record. Same hash = no insert. The re-run is harmless.

This is why re-running is always safe:

Scenario HASHDIFF Result
Source data unchanged Same as latest record No insert — satellite unchanged
Source data corrected Different from latest record New row inserted with new LOAD_DATETIME

The Three Columns You Need

This pattern works because of three distinct columns in your staging model, each serving a different purpose:

Column Airflow Source Meaning Changes on Re-run?
LOGICAL_DATE {{ ds }} Which business day this batch belongs to No
LOAD_DATETIME {{ dag_run.start_date }} When the data was physically loaded into the vault Yes — new timestamp each run
EFFECTIVE_FROM {{ ds }} (logical date) From which business date this record is valid No

In your dbt staging model:

SELECT *,
       TO_DATE('{{ var('logical_date') }}')          AS LOGICAL_DATE,
       TO_TIMESTAMP_NTZ('{{ var('load_datetime') }}') AS LOAD_DATETIME,
       TO_DATE('{{ var('logical_date') }}')           AS EFFECTIVE_FROM
FROM staging

And in your Airflow DAG:

dbt_staging = BashOperator(
    task_id="dbt_staging",
    bash_command=(
        f"cd {DBT_DIR} && dbt run "
        "--select tag:staging "
        "--vars '"
        '{"load_datetime": "'
        '{{ dag_run.start_date.astimezone(dag.timezone)'
        '.strftime("%Y-%m-%d %H:%M:%S") }}", '
        '"logical_date": "'
        '{{ execution_date.in_timezone("Europe/Moscow")'
        '.strftime("%Y-%m-%d") }}"}'
        "'"
    ),
)

Note the timezone handling: dag_run.start_date and execution_date are UTC by default. Convert them to your warehouse timezone to avoid confusion.

Backfill: When You Missed Multiple Days

What if your pipeline was down for 3 days? Use Airflow's backfill command:

airflow dags backfill data_vault_pipeline \
    --start-date 2026-03-10 \
    --end-date 2026-03-13 \
    --reset-dagruns

Airflow creates one DAG run per logical date. Each run:

  • Gets its own logical_date (March 10, 11, 12)
  • Gets dag_run.start_date = when the backfill actually runs (e.g., March 14th)
  • Selects the correct source data window for that day
  • HASHDIFF prevents duplicates if any of those days were already partially loaded

The same principle applies: the logical date defines what data you're processing, dag_run.start_date records when you processed it.

Why This Matters for PIT Tables

Point-in-Time (PIT) tables answer questions like: "What did this customer look like on January 20th?"

A PIT query finds the latest satellite record where EFFECTIVE_FROM <= 2026-01-20. If EFFECTIVE_FROM was set to the LOAD_DATETIME (March 14th from the re-run), the PIT table would think the Gold tier only became effective in March — and would return Silver for January 20th. Wrong answer.

By using the logical date as EFFECTIVE_FROM, the PIT table correctly returns Gold for any query date on or after January 15th, regardless of when the correction was physically loaded.

When to Re-run vs. New Run vs. Backfill

Scenario Action Why
Source data corrected for a past date Re-run that date's DAG run Same logical date, new LOAD_DATETIME, HASHDIFF detects changes
Pipeline failed mid-run Re-run the failed DAG run Picks up where it left off, no duplicates
Pipeline was down for multiple days Backfill the missed range Creates one run per logical date with correct windows
Normal daily processing Scheduled run (automatic) New logical date, new data window
Need to add a new source table retroactively Backfill from the start date Loads historical data with correct dates per period

Summary

  1. Never create a new DAG run to fix past data — it assigns the wrong logical date and breaks your timeline
  2. Re-run the original DAG run — same logical date, same source window, new LOAD_DATETIME
  3. HASHDIFF makes re-runs safe — unchanged data is skipped, changed data gets a new satellite record
  4. EFFECTIVE_FROM must use the logical date — not LOAD_DATETIME — so PIT tables and timeline queries return correct results
  5. LOAD_DATETIME records when the data was physically loaded — giving you a complete audit trail of corrections and reloads

Need help building auditable Data Vault pipelines?

We design production-ready Data Vault 2.0 pipelines with Airflow and dbt, built for correctness from day one.

Get Started