Adrienne Vermorel

Incremental Models in dbt: The Complete Guide

Incremental models are dbt’s primary optimization strategy. Done right, they reduce processing costs by 10-200x and cut runtimes from hours to minutes. Done wrong, they create silent data quality issues that compound over time.

This guide covers everything you need to make incremental models work: strategy selection, warehouse-specific behaviors, late-arriving data patterns, and the decision frameworks that help you choose the right approach.

How Incremental Models Work

On first run, an incremental model behaves like a table materialization (it processes all source data). On subsequent runs, it processes only the rows you specify, then merges or inserts them into the existing table.

The is_incremental() macro returns True when three conditions are met:

  1. The model already exists as a table in the warehouse
  2. The --full-refresh flag isn’t passed
  3. The model is configured with materialized='incremental'
{{ config(materialized='incremental') }}
SELECT
event_id,
event_timestamp,
user_id,
event_type
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

The generated SQL varies by your incremental_strategy. For merge strategies, dbt creates a temporary relation and generates a MERGE INTO statement. For insert_overwrite, it identifies affected partitions and replaces them atomically.

Configuration Options

unique_key

Defines the grain for update/insert decisions. Without it, most strategies default to append-only behavior.

-- Single column
{{ config(unique_key='order_id') }}
-- Composite key
{{ config(unique_key=['user_id', 'session_number']) }}

Columns in unique_key must never contain NULLs. NULL values cause match failures during merge operations, creating duplicates instead of updates.

on_schema_change

Controls behavior when your model’s columns change:

ValueBehavior
ignore (default)New columns aren’t added; removed columns cause failure
failError if schemas diverge
append_new_columnsAdds new columns but doesn’t remove missing ones
sync_all_columnsFull sync including additions and removals

None of these options backfill historical records for newly added columns. If you add a column, existing rows will have NULL values.

incremental_predicates

Limits destination table scans during merge operations (critical for large tables):

{{ config(
materialized='incremental',
unique_key='id',
incremental_strategy='merge',
incremental_predicates=[
"DBT_INTERNAL_DEST.created_at > dateadd(day, -7, current_date)"
]
) }}

This adds filters to the MERGE statement, enabling partition pruning on the destination table. Without predicates, merge scans the entire destination. With predicates, the warehouse prunes partitions and scans only recent data.

Strategy Availability by Warehouse

StrategyBigQuerySnowflakeDatabricks
append
merge✅ Default✅ Default✅ Default
delete+insert✅ (v1.11+)
insert_overwrite✅ (full table only)
microbatch
replace_where✅ Delta only

Note that BigQuery’s insert_overwrite operates on partitions, while Snowflake’s insert_overwrite replaces the entire table. This naming inconsistency catches many practitioners off guard.

Strategy Deep Dive

Merge

Uses SQL MERGE statements that match on unique_key, updating existing rows and inserting new ones.

-- Generated SQL (simplified)
MERGE INTO target AS DBT_INTERNAL_DEST
USING tmp AS DBT_INTERNAL_SOURCE
ON DBT_INTERNAL_DEST.id = DBT_INTERNAL_SOURCE.id
WHEN MATCHED THEN UPDATE SET col1 = DBT_INTERNAL_SOURCE.col1, ...
WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (...)

Merge works well for small-to-medium tables with row-level updates and CDC patterns where individual records change.

The catch: without incremental_predicates, the MERGE statement scans every row in your destination table to check for matches, even if you’re only inserting 1,000 new records into a 500 million row table. This full table scan happens on every run, meaning your “incremental” model still reads the entire table each time. The result is longer runtimes and higher costs that scale with table size rather than with the amount of new data. Above 100M rows, this becomes painfully slow and expensive. Add incremental_predicates to limit the scan to recent partitions (see the Configuration section above).

Delete+Insert

Executes two operations: DELETE matching records, then INSERT all new records.

-- Generated flow
1. Create temp table with query results
2. DELETE FROM target USING tmp WHERE tmp.unique_key = target.unique_key
3. INSERT INTO target SELECT * FROM tmp

This strategy shines on large Snowflake tables (3.4x faster than merge at 500M+ rows) and batch processing where entire time windows need replacement.

The tradeoff is atomicity. Because delete+insert runs two separate operations (DELETE then INSERT), a failure between them leaves your table in an inconsistent state: records are deleted but new ones aren’t inserted yet. Unlike MERGE, which is all-or-nothing, delete+insert can result in data loss if the job crashes, times out, or gets cancelled mid-execution. You’d need to run a full refresh to recover. For critical tables, consider whether this risk is acceptable, or wrap the operations in a transaction where your warehouse supports it. BigQuery doesn’t support this strategy at all; use merge or insert_overwrite instead.

Insert_Overwrite

Replaces entire partitions rather than individual rows.

{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'event_date',
'data_type': 'date'
}
) }}

On BigQuery, this uses partition discovery to determine which partitions to replace:

-- Generated SQL (BigQuery)
DECLARE dbt_partitions_for_replacement ARRAY<DATE>;
SET (dbt_partitions_for_replacement) = (
SELECT AS STRUCT ARRAY_AGG(DISTINCT DATE(partition_col))
FROM tmp_table
);
MERGE INTO target USING tmp ON FALSE
WHEN NOT MATCHED BY SOURCE
AND partition_col IN UNNEST(dbt_partitions_for_replacement)
THEN DELETE
WHEN NOT MATCHED THEN INSERT ...

This strategy is ideal for time-partitioned fact tables, append-heavy workloads, and BigQuery cost optimization.

One major caveat: the name insert_overwrite means different things on different warehouses. On BigQuery, it intelligently replaces only the partitions that contain new data, exactly what you’d expect. On Snowflake, however, it replaces the entire table with whatever your query returns. If your incremental query only selects the last 3 days of data, you’ll lose everything older than that. This naming inconsistency has caught many practitioners off guard, sometimes resulting in catastrophic data loss. On Snowflake, use delete+insert with a date-based unique_key instead to achieve partition-like behavior safely.

Microbatch (dbt 1.9+)

Processes data in time-based batches without requiring is_incremental() logic.

{{ config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session_start',
begin='2020-01-01',
batch_size='day',
lookback=3
) }}
-- No is_incremental() block needed
SELECT
session_id,
session_start,
user_id,
session_duration
FROM {{ ref('base__analytics__sessions') }}

Key differences from traditional incremental:

AspectTraditionalMicrobatch
Query structureSingle SQL for all new dataSeparate SQL per batch
Batch definitionUser-defined in SQLConfigured via event_time, batch_size
Retry granularityEntire modelIndividual batches
BackfillCustom logic requiredBuilt-in --event-time-start/end

Backfill commands:

Terminal window
# Targeted backfill
dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"
# Retry only failed batches
dbt retry

Microbatch excels at large historical backfills, failure recovery scenarios, and teams wanting simpler incremental logic.

Three limitations to consider before adopting it:

  1. UTC timezone assumption: Microbatch uses UTC for all time calculations. If your event_time column is in a local timezone (like US/Eastern), batches will split at midnight UTC, not midnight local time. This can cause records to land in unexpected batches or create off-by-one-day errors in your data. Convert your event_time to UTC before using microbatch, or accept that batch boundaries won’t align with your business day.

  2. No sub-hourly granularity: The smallest batch_size is hour. If you need to process data in 15-minute or 5-minute windows (common for real-time dashboards or alerting), microbatch won’t work. You’ll need traditional incremental logic with custom time filtering.

  3. Sequential execution by default: Each batch runs one after another, not in parallel. A 30-day backfill with batch_size='day' means 30 separate query executions. While this provides better failure isolation (only retry failed batches), it’s slower than a single query that processes all 30 days at once. For very large backfills, consider whether the retry benefits outweigh the speed tradeoff.

Warehouse-Specific Behaviors

BigQuery

Merge:

  • Requires unique_key (without it, merge fails)
  • Does not support dynamic partition pruning; subqueries won’t trigger pruning
  • Full table scans on large tables without clustering

Insert_overwrite:

  • Requires partition_by configuration
  • Static partitions (recommended) are fastest and most cost-effective
  • Dynamic partitions add overhead from partition discovery queries
  • _dbt_max_partition is a BigQuery scripting variable, not Jinja

You cannot use require_partition_filter=true with the merge strategy unless you also set incremental_predicates.

Cost optimization pattern:

{% set partitions_to_replace = [
'CURRENT_DATE()',
'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)',
'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)'
] %}
{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={'field': 'event_date', 'data_type': 'date'},
partitions=partitions_to_replace
) }}
SELECT
event_id,
event_date,
user_id,
event_type
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_date IN ({{ partitions_to_replace | join(',') }})
{% endif %}

Snowflake

Merge:

  • Default strategy with standard MERGE INTO syntax
  • Fails with “nondeterministic merge” if unique_key isn’t truly unique
  • Performance degrades past 100M rows

Delete+insert:

  • Requires tmp_relation_type: table when unique_key is defined
  • 3.4x faster than merge at 500M+ rows with partition-aligned updates

Insert_overwrite does NOT support partition-based overwrites on Snowflake. It replaces the entire table. Use delete+insert with partition predicates instead.

{{ config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='date_day'
) }}
SELECT
event_id,
date_day,
user_id,
event_type
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE date_day >= DATEADD(DAY, -3, CURRENT_DATE)
{% endif %}

Databricks

Merge on Delta Lake:

  • ACID-compliant MERGE INTO
  • Supports schema evolution via merge_with_schema_evolution=true
  • Advanced options: matched_condition, not_matched_by_source_action

Insert_overwrite:

  • Uses INSERT INTO ... REPLACE ON syntax on SQL Warehouses
  • Dynamic partition overwrite via partitionOverwriteMode='dynamic'
  • v1.8.0 regression caused full table replacement; fixed by setting spark.sql.sources.partitionOverwriteMode=DYNAMIC

Replace_where (Databricks exclusive):

{{ config(
materialized='incremental',
incremental_strategy='replace_where',
incremental_predicates=["event_date >= '2024-01-01'"]
) }}

Uses INSERT INTO table REPLACE WHERE predicate for atomic replacement of rows matching predicates.

Handling Late-Arriving Data

Late-arriving data occurs when records arrive after other data from the same time period. Without handling, incremental models progressively drift from source truth.

The Lookback Window Pattern

Subtract a fixed time period from your cutoff:

{% set lookback_days = var('lookback_days', 3) %}
{% if is_incremental() %}
WHERE created_at >= (
SELECT DATEADD(DAY, -{{ lookback_days }}, MAX(created_at))
FROM {{ this }}
)
{% endif %}

Override for custom backfills: dbt run --select model --vars '{"lookback_days": 10}'

Trade-offs by window size:

WindowCoverageCost Impact
1 dayMinimal; misses most late dataLowest
3 daysGood balance; handles 99%+ typical late arrivalsModerate
7 daysCatches nearly allHigher
14+ daysMaximum coverageSignificant

Lookback windows don’t catch records arriving after the window closes. Schedule periodic full refreshes (weekly or monthly) to reset accumulated drift.

Idempotency and Deduplication

Idempotent models produce identical results regardless of run count. Pre-deduplicate in your SELECT:

SELECT
event_id,
event_time,
user_id,
event_type,
event_properties
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_time >= (SELECT MAX(event_time) - INTERVAL 3 DAY FROM {{ this }})
{% endif %}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY event_time DESC
) = 1

The first run doesn’t apply merge deduplication, so duplicates can appear on initial load. Always include deduplication in your SELECT rather than relying solely on unique_key.

Decision Framework

When to Use Incremental Models

Use incremental when:

  • Source data has millions or billions of rows
  • Transformations are computationally expensive
  • dbt runs are becoming too slow or too costly

Thresholds from practitioners:

  • Tables under 10M rows rarely benefit from incremental complexity
  • Merge performance degrades noticeably above 100M rows
  • At 500M+ rows, merge can be 3x slower than optimized alternatives

Which Strategy to Choose

Use CaseRecommended StrategyRationale
Small-medium tables with updatesmergeSimplest; atomic; handles updates
Large tables (>100M rows) on Snowflakedelete+insertAvoids full table scans
Large tables on BigQueryinsert_overwritePartition replacement is efficient
Time-partitioned fact tablesinsert_overwriteNatural partition alignment
Append-only event dataappendNo deduplication needed
Late-arriving data requiring row updatesmerge with incremental_predicatesUpdates specific records
CDC with deletesmerge on Databricks with not_matched_by_source_action='delete'Handles inserts, updates, deletes
Large historical backfillsmicrobatchBuilt-in backfill, failure recovery

Common Anti-Patterns

  1. Not setting unique_key with merge: Results in append-only behavior with duplicates

  2. NULLs in unique_key columns: Causes row matching failures and silent duplicates

  3. No lookback window: Late-arriving data is missed; models drift from source

  4. Using incremental for small tables: Adds complexity without measurable benefit

  5. Missing schema change handling: Adding columns without on_schema_change configuration causes failures

  6. No periodic full refresh: Data drift accumulates indefinitely

  7. Using merge for very large tables without predicates: Full table scans become expensive

  8. Relying solely on unique_key for deduplication: First run doesn’t deduplicate; include QUALIFY or ROW_NUMBER in SELECT

Best Practices Checklist

  • Start with table materialization; switch to incremental when runs become slow
  • Always set unique_key for merge strategy; ensure no NULLs in key columns
  • Implement 3-day lookback windows as default for late-arriving data
  • Use incremental_predicates on tables >100M rows
  • Partition by time for event tables; cluster by frequently filtered columns
  • Schedule periodic full refreshes (weekly/monthly) to reset drift
  • Test both incremental and full-refresh modes before deployment
  • Set full_refresh: false for very large tables to prevent accidental rebuilds
  • Pre-deduplicate in SELECT rather than relying solely on unique_key
  • Monitor partition health (BigQuery) and clustering depth (Snowflake) regularly