Adrienne Vermorel

Incremental Models in dbt: The Complete Guide

February 7, 2026

Incremental models are dbt’s primary optimization strategy. Done right, they reduce processing costs by 10-200x and cut runtimes from hours to minutes. Done wrong, they create silent data quality issues that compound over time.

This guide covers everything you need to make incremental models work: strategy selection, warehouse-specific behaviors, late-arriving data patterns, and the decision frameworks that help you choose the right approach.

How Incremental Models Work

On first run, an incremental model behaves like a table materialization (it processes all source data). On subsequent runs, it processes only the rows you specify, then merges or inserts them into the existing table.

The is_incremental() macro returns True when three conditions are met:

The model already exists as a table in the warehouse
The --full-refresh flag isn’t passed
The model is configured with materialized='incremental'

{{ config(materialized='incremental') }}

SELECT
    event_id,
    event_timestamp,
    user_id,
    event_type
FROM {{ ref('base__analytics__events') }}

{% if is_incremental() %}
WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

The generated SQL varies by your incremental_strategy. For merge strategies, dbt creates a temporary relation and generates a MERGE INTO statement. For insert_overwrite, it identifies affected partitions and replaces them atomically.

Configuration Options

unique_key

Defines the grain for update/insert decisions. Without it, most strategies default to append-only behavior.

-- Single column
{{ config(unique_key='order_id') }}

-- Composite key
{{ config(unique_key=['user_id', 'session_number']) }}

Columns in unique_key must never contain NULLs. NULL values cause match failures during merge operations, creating duplicates instead of updates.

on_schema_change

Controls behavior when your model’s columns change:

Value	Behavior
`ignore` (default)	New columns aren’t added; removed columns cause failure
`fail`	Error if schemas diverge
`append_new_columns`	Adds new columns but doesn’t remove missing ones
`sync_all_columns`	Full sync including additions and removals

None of these options backfill historical records for newly added columns. If you add a column, existing rows will have NULL values.

incremental_predicates

Limits destination table scans during merge operations (critical for large tables):

{{ config(
    materialized='incremental',
    unique_key='id',
    incremental_strategy='merge',
    incremental_predicates=[
        "DBT_INTERNAL_DEST.created_at > dateadd(day, -7, current_date)"
    ]
) }}

This adds filters to the MERGE statement, enabling partition pruning on the destination table. Without predicates, merge scans the entire destination. With predicates, the warehouse prunes partitions and scans only recent data.

Strategy Availability by Warehouse

Strategy	BigQuery	Snowflake	Databricks
append	❌	✅	✅
merge	✅ Default	✅ Default	✅ Default
delete+insert	❌	✅	✅ (v1.11+)
insert_overwrite	✅	✅ (full table only)	✅
microbatch	✅	✅	✅
replace_where	❌	❌	✅ Delta only

Note that BigQuery’s insert_overwrite operates on partitions, while Snowflake’s insert_overwrite replaces the entire table. This naming inconsistency catches many practitioners off guard.

Strategy Deep Dive

Merge

Uses SQL MERGE statements that match on unique_key, updating existing rows and inserting new ones.

-- Generated SQL (simplified)
MERGE INTO target AS DBT_INTERNAL_DEST
USING tmp AS DBT_INTERNAL_SOURCE
ON DBT_INTERNAL_DEST.id = DBT_INTERNAL_SOURCE.id
WHEN MATCHED THEN UPDATE SET col1 = DBT_INTERNAL_SOURCE.col1, ...
WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (...)

Merge works well for small-to-medium tables with row-level updates and CDC patterns where individual records change.

The catch: without incremental_predicates, the MERGE statement scans every row in your destination table to check for matches, even if you’re only inserting 1,000 new records into a 500 million row table. This full table scan happens on every run, meaning your “incremental” model still reads the entire table each time. The result is longer runtimes and higher costs that scale with table size rather than with the amount of new data. Above 100M rows, this becomes painfully slow and expensive. Add incremental_predicates to limit the scan to recent partitions (see the Configuration section above).

Delete+Insert

Executes two operations: DELETE matching records, then INSERT all new records.

-- Generated flow
1. Create temp table with query results
2. DELETE FROM target USING tmp WHERE tmp.unique_key = target.unique_key
3. INSERT INTO target SELECT * FROM tmp

This strategy shines on large Snowflake tables (3.4x faster than merge at 500M+ rows) and batch processing where entire time windows need replacement.

The tradeoff is atomicity. Because delete+insert runs two separate operations (DELETE then INSERT), a failure between them leaves your table in an inconsistent state: records are deleted but new ones aren’t inserted yet. Unlike MERGE, which is all-or-nothing, delete+insert can result in data loss if the job crashes, times out, or gets cancelled mid-execution. You’d need to run a full refresh to recover. For critical tables, consider whether this risk is acceptable, or wrap the operations in a transaction where your warehouse supports it. BigQuery doesn’t support this strategy at all; use merge or insert_overwrite instead.

Insert_Overwrite

Replaces entire partitions rather than individual rows.

{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={
        'field': 'event_date',
        'data_type': 'date'
    }
) }}

On BigQuery, this uses partition discovery to determine which partitions to replace:

-- Generated SQL (BigQuery)
DECLARE dbt_partitions_for_replacement ARRAY<DATE>;
SET (dbt_partitions_for_replacement) = (
    SELECT AS STRUCT ARRAY_AGG(DISTINCT DATE(partition_col))
    FROM tmp_table
);

MERGE INTO target USING tmp ON FALSE
WHEN NOT MATCHED BY SOURCE
  AND partition_col IN UNNEST(dbt_partitions_for_replacement)
THEN DELETE
WHEN NOT MATCHED THEN INSERT ...

This strategy is ideal for time-partitioned fact tables, append-heavy workloads, and BigQuery cost optimization.

One major caveat: the name insert_overwrite means different things on different warehouses. On BigQuery, it intelligently replaces only the partitions that contain new data, exactly what you’d expect. On Snowflake, however, it replaces the entire table with whatever your query returns. If your incremental query only selects the last 3 days of data, you’ll lose everything older than that. This naming inconsistency has caught many practitioners off guard, sometimes resulting in catastrophic data loss. On Snowflake, use delete+insert with a date-based unique_key instead to achieve partition-like behavior safely.

Microbatch (dbt 1.9+)

Processes data in time-based batches without requiring is_incremental() logic.

{{ config(
    materialized='incremental',
    incremental_strategy='microbatch',
    event_time='session_start',
    begin='2020-01-01',
    batch_size='day',
    lookback=3
) }}

-- No is_incremental() block needed
SELECT
    session_id,
    session_start,
    user_id,
    session_duration
FROM {{ ref('base__analytics__sessions') }}

Key differences from traditional incremental:

Aspect	Traditional	Microbatch
Query structure	Single SQL for all new data	Separate SQL per batch
Batch definition	User-defined in SQL	Configured via `event_time`, `batch_size`
Retry granularity	Entire model	Individual batches
Backfill	Custom logic required	Built-in `--event-time-start/end`

Backfill commands:

# Targeted backfill
dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"

# Retry only failed batches
dbt retry

Microbatch excels at large historical backfills, failure recovery scenarios, and teams wanting simpler incremental logic.

Three limitations to consider before adopting it:

UTC timezone assumption: Microbatch uses UTC for all time calculations. If your event_time column is in a local timezone (like US/Eastern), batches will split at midnight UTC, not midnight local time. This can cause records to land in unexpected batches or create off-by-one-day errors in your data. Convert your event_time to UTC before using microbatch, or accept that batch boundaries won’t align with your business day.
No sub-hourly granularity: The smallest batch_size is hour. If you need to process data in 15-minute or 5-minute windows (common for real-time dashboards or alerting), microbatch won’t work. You’ll need traditional incremental logic with custom time filtering.
Sequential execution by default: Each batch runs one after another, not in parallel. A 30-day backfill with batch_size='day' means 30 separate query executions. While this provides better failure isolation (only retry failed batches), it’s slower than a single query that processes all 30 days at once. For very large backfills, consider whether the retry benefits outweigh the speed tradeoff.

Warehouse-Specific Behaviors

BigQuery

Merge:

Requires unique_key (without it, merge fails)
Does not support dynamic partition pruning; subqueries won’t trigger pruning
Full table scans on large tables without clustering

Insert_overwrite:

Requires partition_by configuration
Static partitions (recommended) are fastest and most cost-effective
Dynamic partitions add overhead from partition discovery queries
_dbt_max_partition is a BigQuery scripting variable, not Jinja

You cannot use require_partition_filter=true with the merge strategy unless you also set incremental_predicates.

Cost optimization pattern:

{% set partitions_to_replace = [
    'CURRENT_DATE()',
    'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)',
    'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)'
] %}

{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={'field': 'event_date', 'data_type': 'date'},
    partitions=partitions_to_replace
) }}

SELECT
    event_id,
    event_date,
    user_id,
    event_type
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_date IN ({{ partitions_to_replace | join(',') }})
{% endif %}

Snowflake

Merge:

Default strategy with standard MERGE INTO syntax
Fails with “nondeterministic merge” if unique_key isn’t truly unique
Performance degrades past 100M rows

Delete+insert:

Requires tmp_relation_type: table when unique_key is defined
3.4x faster than merge at 500M+ rows with partition-aligned updates

Insert_overwrite does NOT support partition-based overwrites on Snowflake. It replaces the entire table. Use delete+insert with partition predicates instead.

{{ config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key='date_day'
) }}

SELECT
    event_id,
    date_day,
    user_id,
    event_type
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE date_day >= DATEADD(DAY, -3, CURRENT_DATE)
{% endif %}

Databricks

Merge on Delta Lake:

ACID-compliant MERGE INTO
Supports schema evolution via merge_with_schema_evolution=true
Advanced options: matched_condition, not_matched_by_source_action

Insert_overwrite:

Uses INSERT INTO ... REPLACE ON syntax on SQL Warehouses
Dynamic partition overwrite via partitionOverwriteMode='dynamic'
v1.8.0 regression caused full table replacement; fixed by setting spark.sql.sources.partitionOverwriteMode=DYNAMIC

Replace_where (Databricks exclusive):

{{ config(
    materialized='incremental',
    incremental_strategy='replace_where',
    incremental_predicates=["event_date >= '2024-01-01'"]
) }}

Uses INSERT INTO table REPLACE WHERE predicate for atomic replacement of rows matching predicates.

Handling Late-Arriving Data

Late-arriving data occurs when records arrive after other data from the same time period. Without handling, incremental models progressively drift from source truth.

The Lookback Window Pattern

Subtract a fixed time period from your cutoff:

{% set lookback_days = var('lookback_days', 3) %}

{% if is_incremental() %}
WHERE created_at >= (
    SELECT DATEADD(DAY, -{{ lookback_days }}, MAX(created_at))
    FROM {{ this }}
)
{% endif %}

Override for custom backfills: dbt run --select model --vars '{"lookback_days": 10}'

Trade-offs by window size:

Window	Coverage	Cost Impact
1 day	Minimal; misses most late data	Lowest
3 days	Good balance; handles 99%+ typical late arrivals	Moderate
7 days	Catches nearly all	Higher
14+ days	Maximum coverage	Significant

Lookback windows don’t catch records arriving after the window closes. Schedule periodic full refreshes (weekly or monthly) to reset accumulated drift.

Idempotency and Deduplication

Idempotent models produce identical results regardless of run count. Pre-deduplicate in your SELECT:

SELECT
    event_id,
    event_time,
    user_id,
    event_type,
    event_properties
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_time >= (SELECT MAX(event_time) - INTERVAL 3 DAY FROM {{ this }})
{% endif %}
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY event_id
    ORDER BY event_time DESC
) = 1

The first run doesn’t apply merge deduplication, so duplicates can appear on initial load. Always include deduplication in your SELECT rather than relying solely on unique_key.

Decision Framework

When to Use Incremental Models

Use incremental when:

Source data has millions or billions of rows
Transformations are computationally expensive
dbt runs are becoming too slow or too costly

Thresholds from practitioners:

Tables under 10M rows rarely benefit from incremental complexity
Merge performance degrades noticeably above 100M rows
At 500M+ rows, merge can be 3x slower than optimized alternatives

Which Strategy to Choose

Use Case	Recommended Strategy	Rationale
Small-medium tables with updates	merge	Simplest; atomic; handles updates
Large tables (>100M rows) on Snowflake	delete+insert	Avoids full table scans
Large tables on BigQuery	insert_overwrite	Partition replacement is efficient
Time-partitioned fact tables	insert_overwrite	Natural partition alignment
Append-only event data	append	No deduplication needed
Late-arriving data requiring row updates	merge with `incremental_predicates`	Updates specific records
CDC with deletes	merge on Databricks with `not_matched_by_source_action='delete'`	Handles inserts, updates, deletes
Large historical backfills	microbatch	Built-in backfill, failure recovery

Common Anti-Patterns

Not setting unique_key with merge: Results in append-only behavior with duplicates
NULLs in unique_key columns: Causes row matching failures and silent duplicates
No lookback window: Late-arriving data is missed; models drift from source
Using incremental for small tables: Adds complexity without measurable benefit
Missing schema change handling: Adding columns without on_schema_change configuration causes failures
No periodic full refresh: Data drift accumulates indefinitely
Using merge for very large tables without predicates: Full table scans become expensive
Relying solely on unique_key for deduplication: First run doesn’t deduplicate; include QUALIFY or ROW_NUMBER in SELECT