Add snapshots to your DAG
Related documentation
What are snapshots?
Analysts often need to "look back in time" at previous data states in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt provides a mechanism, snapshots, which records changes to a mutable tableIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. over time.
Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an orders
table where the status
field can be overwritten as the order is processed.
id | status | updated_at |
---|---|---|
1 | pending | 2024-01-01 |
Now, imagine that the order goes from "pending" to "shipped". That same record will now look like:
id | status | updated_at |
---|---|---|
1 | shipped | 2024-01-02 |
This order is now in the "shipped" state, but we've lost the information about when the order was last in the "pending" state. This makes it difficult (or impossible) to analyze how long it took for an order to ship. dbt can "snapshot" these changes to help you understand how values in a row change over time. Here's an example of a snapshot table for the previous example:
id | status | updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|
1 | pending | 2024-01-01 | 2024-01-01 | 2024-01-02 |
1 | shipped | 2024-01-02 | 2024-01-02 | null |
Configuring snapshots
Configure your snapshots in YAML files to tell dbt how to detect record changes. Define snapshots configurations in YAML files, alongside your models, for a cleaner, faster, and more consistent set up.
snapshots:
- name: string
relation: relation # source('my_source', 'my_table') or ref('my_model')
description: markdown_string
config:
database: string
schema: string
alias: string
strategy: timestamp | check
unique_key: column_name_or_expression
check_cols: [column_name] | all
updated_at: column_name
snapshot_meta_column_names: dictionary
dbt_valid_to_current: string
hard_deletes: ignore | invalidate | new_record
The following table outlines the configurations available for snapshots:
Config | Description | Required? | Example |
---|---|---|---|
database | Specify a custom database for the snapshot | No | analytics |
schema | Specify a custom schema for the snapshot | No | snapshots |
alias | Specify an alias for the snapshot | No | your_custom_snapshot |
strategy | The snapshot strategy to use. Valid values: timestamp or check | Yes | timestamp |
unique_key | A primary keyA primary key is a non-null column in a database object that uniquely identifies each row. column(s) (string or array) or expression for the record | Yes | id or [order_id, product_id] |
check_cols | If using the check strategy, then the columns to check | Only if using the check strategy | ["status"] |
updated_at | A column in your snapshot query results that indicates when each record was last updated, used in the timestamp strategy. May support ISO date strings and unix epoch integers, depending on the data platform you use. | Only if using the timestamp strategy | updated_at |
dbt_valid_to_current | Set a custom indicator for the value of dbt_valid_to in current snapshot records (like a future date). By default, this value is NULL . When configured, dbt will use the specified value instead of NULL for dbt_valid_to for current records in the snapshot table. | No | string |
snapshot_meta_column_names | Customize the names of the snapshot meta fields | No | dictionary |
hard_deletes | Specify how to handle deleted rows from the source. Supported options are ignore (default), invalidate (replaces the legacy invalidate_hard_deletes=true ), and new_record . | No | string |
- In versions prior to v1.9, the
target_schema
(required) andtarget_database
(optional) configurations defined a single schema or database to build a snapshot across users and environment. This created problems when testing or developing a snapshot, as there was no clear separation between development and production environments. In v1.9,target_schema
became optional, allowing snapshots to be environment-aware. By default, withouttarget_schema
ortarget_database
defined, snapshots now use thegenerate_schema_name
orgenerate_database_name
macros to determine where to build. Developers can still set a custom location withschema
anddatabase
configs, consistent with other resource types. - A number of other configurations are also supported (for example,
tags
andpost-hook
). For the complete list, refer to Snapshot configurations. - You can configure snapshots from both the
dbt_project.yml
file and aconfig
block. For more information, refer to the configuration docs.
Add a snapshot to your project
To add a snapshot to your project follow these steps. For users on versions 1.8 and earlier, refer to Configure snapshots in versions 1.8 and earlier.
-
Create a YAML file in your
snapshots
directory:snapshots/orders_snapshot.yml
and add your configuration details. You can also configure your snapshot from yourdbt_project.yml
file (docs).snapshots/orders_snapshot.ymlsnapshots:
- name: orders_snapshot
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
database: analytics
unique_key: id
strategy: timestamp
updated_at: updated_at
dbt_valid_to_current: "to_date('9999-12-31')" # Specifies that current records should have `dbt_valid_to` set to `'9999-12-31'` instead of `NULL`. -
Since snapshots focus on configuration, the transformation logic is minimal. Typically, you'd select all data from the source. If you need to apply transformations (like filters, deduplication), it's best practice to define an ephemeral model and reference it in your snapshot configuration.
models/ephemeral_orders.sql{{ config(materialized='ephemeral') }}
select * from {{ source('jaffle_shop', 'orders') }} -
Check whether the result set of your query includes a reliable timestamp column that indicates when a record was last updated. For our example, the
updated_at
column reliably indicates record changes, so we can use thetimestamp
strategy. If your query result set does not have a reliable timestamp, you'll need to instead use thecheck
strategy — more details on this below. -
Run the
dbt snapshot
command — for our example, a new table will be created atanalytics.snapshots.orders_snapshot
. Theschema
config will utilize thegenerate_schema_name
macro.$ dbt snapshot
Running with dbt=1.9.0
15:07:36 | Concurrency: 8 threads (target='dev')
15:07:36 |
15:07:36 | 1 of 1 START snapshot snapshots.orders_snapshot...... [RUN]
15:07:36 | 1 of 1 OK snapshot snapshots.orders_snapshot..........[SELECT 3 in 1.82s]
15:07:36 |
15:07:36 | Finished running 1 snapshots in 0.68s.
Completed successfully
Done. PASS=2 ERROR=0 SKIP=0 TOTAL=1 -
Inspect the results by selecting from the table dbt created (
analytics.snapshots.orders_snapshot
). After the first run, you should see the results of your query, plus the snapshot meta fields as described later on. -
Run the
dbt snapshot
command again and inspect the results. If any records have been updated, the snapshot should reflect this. -
Select from the
snapshot
in downstream models using theref
function.models/changed_orders.sqlselect * from {{ ref('orders_snapshot') }}
-
Snapshots are only useful if you run them frequently — schedule the
dbt snapshot
command to run regularly.
Configuration best practices
How snapshots work
When you run the dbt snapshot
command:
- On the first run: dbt will create the initial snapshot table — this will be the result set of your
select
statement, with additional columns includingdbt_valid_from
anddbt_valid_to
. All records will have adbt_valid_to = null
or the value specified indbt_valid_to_current
(available in dbt Core 1.9+) if configured. - On subsequent runs: dbt will check which records have changed or if any new records have been created:
- The
dbt_valid_to
column will be updated for any existing records that have changed. - The updated record and any new records will be inserted into the snapshot table. These records will now have
dbt_valid_to = null
or the value configured indbt_valid_to_current
(available in dbt Core v1.9+).
- The
Note
- These column names can be customized to your team or organizational conventions using the snapshot_meta_column_names config.
- Use the
dbt_valid_to_current
config to set a custom indicator for the value ofdbt_valid_to
in current snapshot records (like a future date such as9999-12-31
). By default, this value isNULL
. When set, dbt will use this specified value instead ofNULL
fordbt_valid_to
for current records in the snapshot table. - Use the
hard_deletes
config to track hard deletes by adding a new record when row become "deleted" in source. Supported options areignore
,invalidate
, andnew_record
.
Snapshots can be referenced in downstream models the same way as referencing models — by using the ref function.
Detecting row changes
Snapshot "strategies" define how dbt knows if a row has changed. There are two strategies built-in to dbt:
- Timestamp — Uses an
updated_at
column to determine if a row has changed. - Check — Compares a list of columns between their current and historical values to determine if a row has changed.
Timestamp strategy (recommended)
The timestamp
strategy uses an updated_at
field to determine if a row has changed. If the configured updated_at
column for a row is more recent than the last time the snapshot ran, then dbt will invalidate the old record and record the new one. If the timestamps are unchanged, then dbt will not take any action.
The timestamp
strategy requires the following configurations:
Config | Description | Example |
---|---|---|
updated_at | A column which represents when the source row was last updated. May support ISO date strings and unix epoch integers, depending on the data platform you use. | updated_at |
Example usage:
snapshots:
- name: orders_snapshot_timestamp
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: timestamp
updated_at: updated_at
Check strategy
The check
strategy is useful for tables which do not have a reliable updated_at
column. This strategy works by comparing a list of columns between their current and historical values. If any of these columns have changed, then dbt will invalidate the old record and record the new one. If the column values are identical, then dbt will not take any action.
The check
strategy requires the following configurations:
Config | Description | Example |
---|---|---|
check_cols | A list of columns to check for changes, or all to check all columns | ["name", "email"] |
The check
snapshot strategy can be configured to track changes to all columns by supplying check_cols = 'all'
. It is better to explicitly enumerate the columns that you want to check. Consider using a surrogate keyA surrogate key is a unique identifier derived from the data itself. It often takes the form of a hashed value of multiple columns that will create a uniqueness constraint for each row. to condense many columns into a single column.
Example usage
snapshots:
- name: orders_snapshot_check
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: check
check_cols:
- status
- is_cancelled
Hard deletes (opt-in)
In dbt v1.9 and higher, the hard_deletes
config replaces the invalidate_hard_deletes
config to give you more control on how to handle deleted rows from the source. The hard_deletes
config is not a separate strategy but an additional opt-in feature that can be used with any snapshot strategy.
The hard_deletes
config has three options/fields:
Field | Description |
---|---|
ignore (default) | No action for deleted records. |
invalidate | Behaves the same as the existing invalidate_hard_deletes=true , where deleted records are invalidated by setting dbt_valid_to . |
new_record | Tracks deleted records as new rows using the dbt_is_deleted meta field when records are deleted. |
Example usage
snapshots:
- name: orders_snapshot_hard_delete
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: timestamp
updated_at: updated_at
hard_deletes: new_record # options are: 'ignore', 'invalidate', or 'new_record'
In this example, the hard_deletes: new_record
config will add a new row for deleted records with the dbt_is_deleted
column set to True
.
Any restored records are added as new rows with the dbt_is_deleted
field set to False
.
The resulting table will look like this:
id | status | updated_at | dbt_valid_from | dbt_valid_to | dbt_is_deleted |
---|---|---|---|---|---|
1 | pending | 2024-01-01 10:47 | 2024-01-01 10:47 | 2024-01-01 11:05 | False |
1 | shipped | 2024-01-01 11:05 | 2024-01-01 11:05 | 2024-01-01 11:20 | False |
1 | deleted | 2024-01-01 11:20 | 2024-01-01 11:20 | 2024-01-01 12:00 | True |
1 | restored | 2024-01-01 12:00 | 2024-01-01 12:00 | False |
Snapshot meta-fields
Snapshot tablesIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. will be created as a clone of your source dataset, plus some additional meta-fields*.
In dbt Core v1.9+ (or available sooner in the "Latest" release track in dbt Cloud):
- These column names can be customized to your team or organizational conventions using the
snapshot_meta_column_names
config. - Use the
dbt_valid_to_current
config to set a custom indicator for the value ofdbt_valid_to
in current snapshot records (like a future date such as9999-12-31
). By default, this value isNULL
. When set, dbt will use this specified value instead ofNULL
fordbt_valid_to
for current records in the snapshot table. - Use the
hard_deletes
config to track deleted records as new rows with thedbt_is_deleted
meta field when using thehard_deletes='new_record'
field.
Field | Meaning | Notes | Example |
---|---|---|---|
dbt_valid_from | The timestamp when this snapshot row was first inserted and became valid. | This column can be used to order the different "versions" of a record. | snapshot_meta_column_names: {dbt_valid_from: start_date} |
dbt_valid_to | The timestamp when this row became invalidated. For current records, this is NULL by default or the value specified in dbt_valid_to_current . | The most recent snapshot record will have dbt_valid_to set to NULL or the specified value. | snapshot_meta_column_names: {dbt_valid_to: end_date} |
dbt_scd_id | A unique key generated for each snapshot row. | This is used internally by dbt. | snapshot_meta_column_names: {dbt_scd_id: scd_id} |
dbt_updated_at | The updated_at timestamp of the source record when this snapshot row was inserted. | This is used internally by dbt. | snapshot_meta_column_names: {dbt_updated_at: modified_date} |
dbt_is_deleted | A string value indicating if the record has been deleted. (True if deleted, False if not deleted). | Added when hard_deletes='new_record' is configured. | snapshot_meta_column_names: {dbt_is_deleted: is_deleted} |
All of these column names can be customized using the snapshot_meta_column_names
config. Refer to this example for more details.
*The timestamps used for each column are subtly different depending on the strategy you use:
-
For the
timestamp
strategy, the configuredupdated_at
column is used to populate thedbt_valid_from
,dbt_valid_to
anddbt_updated_at
columns. -
For the
check
strategy, the current timestamp is used to populate each column. If configured, thecheck
strategy uses theupdated_at
column instead, as with the timestamp strategy.
Configure snapshots in versions 1.8 and earlier
For information about configuring snapshots in dbt versions 1.8 and earlier, select 1.8 from the documentation version picker, and it will appear in this section.
To configure snapshots in versions 1.9 and later, refer to Configuring snapshots. The latest versions use an updated snapshot configuration syntax that optimizes performance.