Apache Airflow is a workflow orchestration tool that helps automate and schedule workflows (ETL, data pipelines, etc.). At its core, it consists of DAGs, Operators, and Tasks, which define how workflows are executed.
To follow along and use can install airflow in local python folllowing this tutorial.
Directed Acyclic Graph (DAG)
A DAG (Directed Acyclic Graph) is a collection of tasks that Airflow schedules and runs in a defined order. It ensures that tasks are executed as part of a single DAG run and follow dependencies correctly.
The DAG does not manage the internal execution of tasks; its primary role is to define how they are executed, determining their order, retry attempts, timeouts, and other scheduling parameters.
Key Features of a DAG
- Directed: Tasks execute in a specific order.
- Acyclic: No circular dependencies (Task A → Task B → Task A is NOT allowed).
- Graph: Represents a workflow where tasks are nodes, and dependencies are edges.
Creating a DAG in Airflow
There are three way to decalre DAG:
- Using Dag constructor
from airflow import DAG from airflow.operators.dummy import DummyOperator from datetime import datetime # Define DAG dag = DAG( "dag_using_constructor", schedule_interval="@daily", # Run daily start_date=datetime(2025, 3, 1), catchup=False ) # Define tasks start_task = DummyOperator(task_id="start", dag=dag) end_task = DummyOperator(task_id="end", dag=dag) # Define dependencies start_task >> end_task
- Using a python context manager with
from airflow import DAG from airflow.operators.dummy import DummyOperator from datetime import datetime # Define DAG using context manager with DAG( "dag_using_context_manager", schedule_interval="@daily", start_date=datetime(2025, 3, 1) ) as dag: # Define tasks start_task = DummyOperator(task_id="start", dag=dag) end_task = DummyOperator(task_id="end", dag=dag) # Define dependencies start_task >> end_task
- Using a @dag decorator
from airflow.decorators import dag from airflow.operators.empty import EmptyOperator from datetime import datetime # Define Dag using taskflow @dag(start_date=datetime(2025, 3, 1), schedule="@daily") def generate_dag(): EmptyOperator(task_id="task") generate_dag()
When defining a DAG, three essential parameters must be provided.
dag_id
: The unique identifier for the DAG within an Airflow instance.schedule
: Defines the execution frequency.start_date
: Specifies when the DAG’s first execution will occur.
ℹ️ When using a decorator method to create a DAG, if
dag_id
is not provided, it defaults to the name of the function defining the DAG. In the example above, since we did not specify adag_id
, the DAG name will begenerate_dag_decorator
.