Adrienne Vermorel
Incremental Models in dbt: The Complete Guide
Incremental models are dbt’s primary optimization strategy. Done right, they reduce processing costs by 10-200x and cut runtimes from hours to minutes. Done wrong, they create silent data quality issues that compound over time.
This guide covers everything you need to make incremental models work: strategy selection, warehouse-specific behaviors, late-arriving data patterns, and the decision frameworks that help you choose the right approach.
How Incremental Models Work
On first run, an incremental model behaves like a table materialization (it processes all source data). On subsequent runs, it processes only the rows you specify, then merges or inserts them into the existing table.
The is_incremental() macro returns True when three conditions are met:
- The model already exists as a table in the warehouse
- The
--full-refreshflag isn’t passed - The model is configured with
materialized='incremental'
{{ config(materialized='incremental') }}
SELECT event_id, event_timestamp, user_id, event_typeFROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }}){% endif %}The generated SQL varies by your incremental_strategy. For merge strategies, dbt creates a temporary relation and generates a MERGE INTO statement. For insert_overwrite, it identifies affected partitions and replaces them atomically.
Configuration Options
unique_key
Defines the grain for update/insert decisions. Without it, most strategies default to append-only behavior.
-- Single column{{ config(unique_key='order_id') }}
-- Composite key{{ config(unique_key=['user_id', 'session_number']) }}Columns in unique_key must never contain NULLs. NULL values cause match failures during merge operations, creating duplicates instead of updates.
on_schema_change
Controls behavior when your model’s columns change:
| Value | Behavior |
|---|---|
ignore (default) | New columns aren’t added; removed columns cause failure |
fail | Error if schemas diverge |
append_new_columns | Adds new columns but doesn’t remove missing ones |
sync_all_columns | Full sync including additions and removals |
None of these options backfill historical records for newly added columns. If you add a column, existing rows will have NULL values.
incremental_predicates
Limits destination table scans during merge operations (critical for large tables):
{{ config( materialized='incremental', unique_key='id', incremental_strategy='merge', incremental_predicates=[ "DBT_INTERNAL_DEST.created_at > dateadd(day, -7, current_date)" ]) }}This adds filters to the MERGE statement, enabling partition pruning on the destination table. Without predicates, merge scans the entire destination. With predicates, the warehouse prunes partitions and scans only recent data.
Strategy Availability by Warehouse
| Strategy | BigQuery | Snowflake | Databricks |
|---|---|---|---|
| append | ❌ | ✅ | ✅ |
| merge | ✅ Default | ✅ Default | ✅ Default |
| delete+insert | ❌ | ✅ | ✅ (v1.11+) |
| insert_overwrite | ✅ | ✅ (full table only) | ✅ |
| microbatch | ✅ | ✅ | ✅ |
| replace_where | ❌ | ❌ | ✅ Delta only |
Note that BigQuery’s insert_overwrite operates on partitions, while Snowflake’s insert_overwrite replaces the entire table. This naming inconsistency catches many practitioners off guard.
Strategy Deep Dive
Merge
Uses SQL MERGE statements that match on unique_key, updating existing rows and inserting new ones.
-- Generated SQL (simplified)MERGE INTO target AS DBT_INTERNAL_DESTUSING tmp AS DBT_INTERNAL_SOURCEON DBT_INTERNAL_DEST.id = DBT_INTERNAL_SOURCE.idWHEN MATCHED THEN UPDATE SET col1 = DBT_INTERNAL_SOURCE.col1, ...WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (...)Merge works well for small-to-medium tables with row-level updates and CDC patterns where individual records change.
The catch: without incremental_predicates, the MERGE statement scans every row in your destination table to check for matches, even if you’re only inserting 1,000 new records into a 500 million row table. This full table scan happens on every run, meaning your “incremental” model still reads the entire table each time. The result is longer runtimes and higher costs that scale with table size rather than with the amount of new data. Above 100M rows, this becomes painfully slow and expensive. Add incremental_predicates to limit the scan to recent partitions (see the Configuration section above).
Delete+Insert
Executes two operations: DELETE matching records, then INSERT all new records.
-- Generated flow1. Create temp table with query results2. DELETE FROM target USING tmp WHERE tmp.unique_key = target.unique_key3. INSERT INTO target SELECT * FROM tmpThis strategy shines on large Snowflake tables (3.4x faster than merge at 500M+ rows) and batch processing where entire time windows need replacement.
The tradeoff is atomicity. Because delete+insert runs two separate operations (DELETE then INSERT), a failure between them leaves your table in an inconsistent state: records are deleted but new ones aren’t inserted yet. Unlike MERGE, which is all-or-nothing, delete+insert can result in data loss if the job crashes, times out, or gets cancelled mid-execution. You’d need to run a full refresh to recover. For critical tables, consider whether this risk is acceptable, or wrap the operations in a transaction where your warehouse supports it. BigQuery doesn’t support this strategy at all; use merge or insert_overwrite instead.
Insert_Overwrite
Replaces entire partitions rather than individual rows.
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'event_date', 'data_type': 'date' }) }}On BigQuery, this uses partition discovery to determine which partitions to replace:
-- Generated SQL (BigQuery)DECLARE dbt_partitions_for_replacement ARRAY<DATE>;SET (dbt_partitions_for_replacement) = ( SELECT AS STRUCT ARRAY_AGG(DISTINCT DATE(partition_col)) FROM tmp_table);
MERGE INTO target USING tmp ON FALSEWHEN NOT MATCHED BY SOURCE AND partition_col IN UNNEST(dbt_partitions_for_replacement)THEN DELETEWHEN NOT MATCHED THEN INSERT ...This strategy is ideal for time-partitioned fact tables, append-heavy workloads, and BigQuery cost optimization.
One major caveat: the name insert_overwrite means different things on different warehouses. On BigQuery, it intelligently replaces only the partitions that contain new data, exactly what you’d expect. On Snowflake, however, it replaces the entire table with whatever your query returns. If your incremental query only selects the last 3 days of data, you’ll lose everything older than that. This naming inconsistency has caught many practitioners off guard, sometimes resulting in catastrophic data loss. On Snowflake, use delete+insert with a date-based unique_key instead to achieve partition-like behavior safely.
Microbatch (dbt 1.9+)
Processes data in time-based batches without requiring is_incremental() logic.
{{ config( materialized='incremental', incremental_strategy='microbatch', event_time='session_start', begin='2020-01-01', batch_size='day', lookback=3) }}
-- No is_incremental() block neededSELECT session_id, session_start, user_id, session_durationFROM {{ ref('base__analytics__sessions') }}Key differences from traditional incremental:
| Aspect | Traditional | Microbatch |
|---|---|---|
| Query structure | Single SQL for all new data | Separate SQL per batch |
| Batch definition | User-defined in SQL | Configured via event_time, batch_size |
| Retry granularity | Entire model | Individual batches |
| Backfill | Custom logic required | Built-in --event-time-start/end |
Backfill commands:
# Targeted backfilldbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"
# Retry only failed batchesdbt retryMicrobatch excels at large historical backfills, failure recovery scenarios, and teams wanting simpler incremental logic.
Three limitations to consider before adopting it:
-
UTC timezone assumption: Microbatch uses UTC for all time calculations. If your
event_timecolumn is in a local timezone (like US/Eastern), batches will split at midnight UTC, not midnight local time. This can cause records to land in unexpected batches or create off-by-one-day errors in your data. Convert your event_time to UTC before using microbatch, or accept that batch boundaries won’t align with your business day. -
No sub-hourly granularity: The smallest
batch_sizeishour. If you need to process data in 15-minute or 5-minute windows (common for real-time dashboards or alerting), microbatch won’t work. You’ll need traditional incremental logic with custom time filtering. -
Sequential execution by default: Each batch runs one after another, not in parallel. A 30-day backfill with
batch_size='day'means 30 separate query executions. While this provides better failure isolation (only retry failed batches), it’s slower than a single query that processes all 30 days at once. For very large backfills, consider whether the retry benefits outweigh the speed tradeoff.
Warehouse-Specific Behaviors
BigQuery
Merge:
- Requires
unique_key(without it, merge fails) - Does not support dynamic partition pruning; subqueries won’t trigger pruning
- Full table scans on large tables without clustering
Insert_overwrite:
- Requires
partition_byconfiguration - Static partitions (recommended) are fastest and most cost-effective
- Dynamic partitions add overhead from partition discovery queries
_dbt_max_partitionis a BigQuery scripting variable, not Jinja
You cannot use require_partition_filter=true with the merge strategy unless you also set incremental_predicates.
Cost optimization pattern:
{% set partitions_to_replace = [ 'CURRENT_DATE()', 'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)', 'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)'] %}
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={'field': 'event_date', 'data_type': 'date'}, partitions=partitions_to_replace) }}
SELECT event_id, event_date, user_id, event_typeFROM {{ ref('base__analytics__events') }}{% if is_incremental() %}WHERE event_date IN ({{ partitions_to_replace | join(',') }}){% endif %}Snowflake
Merge:
- Default strategy with standard MERGE INTO syntax
- Fails with “nondeterministic merge” if unique_key isn’t truly unique
- Performance degrades past 100M rows
Delete+insert:
- Requires
tmp_relation_type: tablewhen unique_key is defined - 3.4x faster than merge at 500M+ rows with partition-aligned updates
Insert_overwrite does NOT support partition-based overwrites on Snowflake. It replaces the entire table. Use delete+insert with partition predicates instead.
{{ config( materialized='incremental', incremental_strategy='delete+insert', unique_key='date_day') }}
SELECT event_id, date_day, user_id, event_typeFROM {{ ref('base__analytics__events') }}{% if is_incremental() %}WHERE date_day >= DATEADD(DAY, -3, CURRENT_DATE){% endif %}Databricks
Merge on Delta Lake:
- ACID-compliant
MERGE INTO - Supports schema evolution via
merge_with_schema_evolution=true - Advanced options:
matched_condition,not_matched_by_source_action
Insert_overwrite:
- Uses
INSERT INTO ... REPLACE ONsyntax on SQL Warehouses - Dynamic partition overwrite via
partitionOverwriteMode='dynamic' - v1.8.0 regression caused full table replacement; fixed by setting
spark.sql.sources.partitionOverwriteMode=DYNAMIC
Replace_where (Databricks exclusive):
{{ config( materialized='incremental', incremental_strategy='replace_where', incremental_predicates=["event_date >= '2024-01-01'"]) }}Uses INSERT INTO table REPLACE WHERE predicate for atomic replacement of rows matching predicates.
Handling Late-Arriving Data
Late-arriving data occurs when records arrive after other data from the same time period. Without handling, incremental models progressively drift from source truth.
The Lookback Window Pattern
Subtract a fixed time period from your cutoff:
{% set lookback_days = var('lookback_days', 3) %}
{% if is_incremental() %}WHERE created_at >= ( SELECT DATEADD(DAY, -{{ lookback_days }}, MAX(created_at)) FROM {{ this }}){% endif %}Override for custom backfills: dbt run --select model --vars '{"lookback_days": 10}'
Trade-offs by window size:
| Window | Coverage | Cost Impact |
|---|---|---|
| 1 day | Minimal; misses most late data | Lowest |
| 3 days | Good balance; handles 99%+ typical late arrivals | Moderate |
| 7 days | Catches nearly all | Higher |
| 14+ days | Maximum coverage | Significant |
Lookback windows don’t catch records arriving after the window closes. Schedule periodic full refreshes (weekly or monthly) to reset accumulated drift.
Idempotency and Deduplication
Idempotent models produce identical results regardless of run count. Pre-deduplicate in your SELECT:
SELECT event_id, event_time, user_id, event_type, event_propertiesFROM {{ ref('base__analytics__events') }}{% if is_incremental() %}WHERE event_time >= (SELECT MAX(event_time) - INTERVAL 3 DAY FROM {{ this }}){% endif %}QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY event_time DESC) = 1The first run doesn’t apply merge deduplication, so duplicates can appear on initial load. Always include deduplication in your SELECT rather than relying solely on unique_key.
Decision Framework
When to Use Incremental Models
Use incremental when:
- Source data has millions or billions of rows
- Transformations are computationally expensive
- dbt runs are becoming too slow or too costly
Thresholds from practitioners:
- Tables under 10M rows rarely benefit from incremental complexity
- Merge performance degrades noticeably above 100M rows
- At 500M+ rows, merge can be 3x slower than optimized alternatives
Which Strategy to Choose
| Use Case | Recommended Strategy | Rationale |
|---|---|---|
| Small-medium tables with updates | merge | Simplest; atomic; handles updates |
| Large tables (>100M rows) on Snowflake | delete+insert | Avoids full table scans |
| Large tables on BigQuery | insert_overwrite | Partition replacement is efficient |
| Time-partitioned fact tables | insert_overwrite | Natural partition alignment |
| Append-only event data | append | No deduplication needed |
| Late-arriving data requiring row updates | merge with incremental_predicates | Updates specific records |
| CDC with deletes | merge on Databricks with not_matched_by_source_action='delete' | Handles inserts, updates, deletes |
| Large historical backfills | microbatch | Built-in backfill, failure recovery |
Common Anti-Patterns
-
Not setting unique_key with merge: Results in append-only behavior with duplicates
-
NULLs in unique_key columns: Causes row matching failures and silent duplicates
-
No lookback window: Late-arriving data is missed; models drift from source
-
Using incremental for small tables: Adds complexity without measurable benefit
-
Missing schema change handling: Adding columns without
on_schema_changeconfiguration causes failures -
No periodic full refresh: Data drift accumulates indefinitely
-
Using merge for very large tables without predicates: Full table scans become expensive
-
Relying solely on unique_key for deduplication: First run doesn’t deduplicate; include QUALIFY or ROW_NUMBER in SELECT
Best Practices Checklist
- Start with table materialization; switch to incremental when runs become slow
- Always set
unique_keyfor merge strategy; ensure no NULLs in key columns - Implement 3-day lookback windows as default for late-arriving data
- Use
incremental_predicateson tables >100M rows - Partition by time for event tables; cluster by frequently filtered columns
- Schedule periodic full refreshes (weekly/monthly) to reset drift
- Test both incremental and full-refresh modes before deployment
- Set
full_refresh: falsefor very large tables to prevent accidental rebuilds - Pre-deduplicate in SELECT rather than relying solely on unique_key
- Monitor partition health (BigQuery) and clustering depth (Snowflake) regularly