In this tutorial, you will learn how to define and use sources in dbt. Sources allow you to reference raw data in your data warehouse in a structured and maintainable way, improving data lineage, documentation, and quality control.
This tutorial will guide you through step-by-step examples, making it easy to integrate sources into your dbt workflow. Whether you are new to dbt or looking to enhance your existing projects, this tutorial will provide the knowledge you need to organize and manage your raw data effectively.
What are sources in dbt ? #
Sources allow you to define and document the data that your Extract and Load (EL) tools bring into your data warehouse. By declaring tables as sources in dbt, you can:
- Reference source tables in your models using the
{{ source() }}
function, making data lineage clear. - Validate assumptions about your source data through tests.
- Monitor data freshness to ensure timely and accurate transformations.
Create a source #
To create a source, simply add a filename.yaml
file inside the models directory where your dbt models are defined.
In the example below, we define two sources: dev
and prod
.
- The
dev
source includes two tables from thedbt-demo
database within thedemo
schema. - The
prod
source contains one table.
version: 2 sources: - name: dev database: dbt-demo schema: demo tables: - name: orders - name: customers - name: prod tables: - name: payments
ℹ️ By default, the schema matches the source name. Include the schema only if the source name differs from the actual schema in your database.
Using a source #
A source in dbt serves three main purposes:
- Selecting data within models from a source.
- Testing data quality of the source.
- Checking the freshness of the source data.
In this tutorial, we’ll focus only on using sources for selecting data. Dedicated chapters will cover the concepts of data quality testing and source checking freshness.
In the previous tutorial about dbt model, we created the completed_orders
model by selecting data directly from demo.orders
.
SELECT order_id, customer_id, order_date, total_amount FROM demo.orders WHERE status = 'Completed'
Now, let’s utilize the dev
source we defined earlier. In the code below, the FROM
statement has been updated to reference the source instead of directly selecting from the table.
SELECT order_id, customer_id, order_date, total_amount FROM {{ source('dev', 'orders') }} WHERE status = 'Completed'