Apache Airflow 101

ApacheAirflow

DataEngineering

WorkflowOrchestration

ETL

DataPipelines

Smit Makadia Software Engineer @ Infocusp

13 min read . 3 June 2025

blog banner

In the fast-paced world of data engineering, keeping your workflows in sync is like conducting a digital symphony. That’s where orchestration tools come in – and Apache Airflow is one of the standout performers. But beyond the buzz, what is Airflow really all about, and how do you get your hands on the baton? Let’s unpack it from the ground up.

What is Airflow?

Apache Airflow is a flexible, open-source framework designed to let you build, schedule, and keep an eye on your data workflows. It transforms complex pipeline management into a streamlined, programmable experience, giving you full control from start to finish.

Think of it as a workflow orchestrator: you define tasks (like pulling data, cleaning it, storing it, sending a report), and Airflow handles when and how each task runs. It doesn’t process data itself – it just tells each part when to play its role, like a conductor in an orchestra.

Why use Airflow?

Automation: Run your workflows automatically – on a schedule or when triggered.
Visibility: Monitor task status easily with a clean web UI.
Scalability: Handles everything from simple jobs to complex pipelines with hundreds of tasks.
Modularity: Write workflows as Python code (DAGs), making them reusable and versionable.
Integration: Works smoothly with AWS, GCP, Azure, and other tools via operators and hooks. Example: uploading a file to AWS S3, triggering BigQuery jobs etc.
Parallel Execution: Efficiently runs multiple tasks in parallel using available workers.

Architecture

Airflow is made up of several core components that work together to define, schedule, and execute workflows. Here's how they fit together:

airflow architecture

🧠 1. Scheduler

The Scheduler is Airflow’s brain. It reads DAGs, checks task schedules, and decides what to run when. Think of it like your workflow’s calendar assistant.

⚙️ 2. Executor

The Executor actually runs the tasks. It can do this:

On the same machine (LocalExecutor)
On distributed workers (CeleryExecutor / KubernetesExecutor)

The Scheduler delegates → the Executor executes.

🌐 3. Web Server (UI)

The Airflow Web UI is your control room. It's clean, intuitive, and super helpful. You can do many things in UI like:

View DAG structures
Monitor task status
Trigger or re-run tasks

📦 4. Metadata Database

Everything runs through this Metadata DB – it's the central source of truth. Airflow uses a relational DB (like PostgreSQL/MySQL) to store:

DAG definitions
Task run histories
Scheduling info

💻 5. Workers (for distributed setups)

In scalable environments (e.g., Celery/Kubernetes), tasks run on separate worker nodes, enabling parallel execution across machines.

📂 6. DAG Directory – The Workflow Vault

When the Airflow Scheduler starts up, it scans the DAG Directory to discover new DAGs, parse them, and decide which tasks need to be run based on their schedule. For example this DAG directory is a path to a folder in GCS bucket if our airflow dags are hosted on GCP composer.

Core Concepts

DAGs / How does Airflow orchestrate workflows?

In Airflow, workflows are defined as DAGs, which stands for Directed Acyclic Graphs.

That might sound technical at first, but it's just a fancy way of describing a set of tasks that are connected in a specific order – and that order doesn’t loop back on itself (hence “acyclic”).

DAG

Directed: The workflow has a clear direction. One task runs before or after another.
Acyclic: There are no cycles – meaning a task won’t end up calling itself again, directly or indirectly.
Graph: It’s a collection of nodes (tasks) and edges (dependencies).

Task

Each task in a DAG represents a single unit of work, like downloading a file, transforming data, or sending an email.
The DAG defines how these tasks are connected – for example, “task B should only run after task A finishes successfully”.

Airflow uses this DAG to orchestrate the workflow deciding how your tasks are structured, when to run each task, what order to run them in, and how to handle retries, failures, and dependencies.

So when we say Airflow is a workflow orchestrator, what we really mean is: it manages and runs your DAGs, making sure every task happens at the right time and in the right way.

In code, a DAG is defined using Python. Here’s a super basic example:

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime

with DAG(
    dag_id="example_dag",
    start_date=datetime(2023, 1, 1),
    schedule_interval="@daily",  # Run once a day
    catchup=False
) as dag:
    start = DummyOperator(task_id="start")
    end = DummyOperator(task_id="end")

    start >> end  # Define the task dependency

This defines a simple DAG that runs once per day and has two tasks: start and end, where end should only run after start finishes successfully.

Key things a DAG defines:

Schedule: How often the DAG should run.
Start date: When the DAG should begin scheduling.
Tasks: What needs to be done, broken into steps.
Dependencies: How tasks depend on each other.

It’s important to remember that a DAG doesn’t do any real work itself. It’s just a blueprint. The actual work happens in the tasks that the DAG organizes and runs.

Operators

If a DAG is the blueprint of your workflow, then Operators are the building blocks that define what each task does.

In Airflow, Operator is one of the ways to define a task. You can think of Operators like tools in a toolbox – each one is built for a specific kind of job.

What is an Operator?

An Operator is a predefined class in Airflow that performs a specific action. When you are instantiating an Operator, you're actually creating a task in a DAG.

Types of Operators

Action Operator (executes something)	Transfer Operator (moves data from one place to another)	Sensor Operator (waits for something)
BashOperator (Runs a shell command)	S3ToGCSOperator (To move data from S3 to GCS)	S3KeySensor (Waits for a specific file or key to appear in an S3 bucket)
EmailOperator (Sends an email)	LocalFileSystemToGCSOperator (To move data from local to GCS)	FileSensor (Waits for a file to exist at a specified local or mounted path)
PythonOperator (Runs a Python function)	S3ToRedshiftOperator (Transfer data from S3 to Redshift table)	HttpSensorAsync (Asynchronously waits for an HTTP endpoint to return a valid response)
…	…	…

Hooks

In Airflow, Hooks act as bridges between your workflows and the outside world – whether it's a database, cloud service, API, or another external system. They handle the behind-the-scenes connections, so your pipelines can talk to other platforms without breaking a sweat.

While Operators define what a task does, Hooks handle how it connects to something outside of Airflow.

What does a Hook do?

A Hook knows how to:

Authenticate with an external service (like AWS, GCP, or a SQL database)
Establish a connection
Run commands or queries (e.g., run a SQL query, upload to S3, call an API)

For example, the PostgresHook knows how to connect to a Postgres database and run SQL commands, the S3Hook knows how to talk to Amazon S3.

Here’s an example of using Hook in a custom PythonOperator:

from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.operators.python import PythonOperator

def query_postgres():
    hook = PostgresHook(postgres_conn_id="my_postgres_conn")
    records = hook.get_records("SELECT * FROM users")
    print(records)

query_task = PythonOperator(
    task_id="query_users",
    python_callable=query_postgres
)

In this example:

PostgresHook uses a connection defined in Airflow (my_postgres_conn)
It handles authentication and connection pooling
You don’t have to write raw connection logic—Airflow does it for you

Hooks Power Operators

Many built-in Operators (like S3ToRedshiftOperator, BigQueryOperator, etc.) use Hooks under the hood. As a developer, you’ll rarely use Hooks directly unless you’re writing custom logic – but it’s useful to know they’re doing the heavy lifting.

Hooks vs Operators

Both are python classes but they differ in their level of abstraction.

operator_vs_hook

Which one should we use?

If an operator is available to perform the task, use operator first.
Then if a hook is available, use a hook.
Then at last write your own code to accomplish the task.

Sensors

Sensors are a special type of Operator that are designed to do exactly one thing – wait for something to occur.

sensor_workflow

sensor_task = FileSensor(
    task_id='check_for_file',
    filepath='/path/to/file.txt',
    mode='poke',               # Default: 'poke'
    # mode='reschedule'        # Switch to reschedule mode
    poke_interval=60,          # Default: 60 seconds
    timeout=600,               # Default: (60*60*24*7) 7 days
    soft_fail=True,            # Default: False
    deferrable=False,          # Default: False
)

In this example:

FileSensor is the sensor operator that waits for a file named ‘/path/to/file.txt’.
mode=’poke’, It checks for the file every poke_interval seconds (here, 60s); more details in next section.
timeout, defines after how much time we should stop looking for the file & mark the task as failed or skipped (here, 10 mins).
soft_fail=True, marks the task as skipped instead of failed if the file isn’t found in time
deferrable=False, uses regular sensors instead of the new async deferrable ones (more details in following section).

Sensor Modes

poke (default): The Sensor takes up a worker slot for its entire runtime.

sensor_poke_mode

reschedule: The Sensor takes up a worker slot only when it is checking, and sleeps for a set duration between checks.

sensor_reschedule_mode

Deferrable Operators & Triggers

When a task has nothing to do but wait, an operator can suspend itself and free up the worker for other processes by deferring.
When an operator defers, execution moves to the triggerer.
Tasks in a deferred state don’t occupy pool slots.

deferrable_operators_triggers

Branching and Triggers

In real-life data workflows, the path forward isn’t always linear. Sometimes, your DAG needs to make decisions – run different tasks depending on some condition. That’s where Branching and Trigger Rules come in.

Branching: Choosing the Path at Runtime

Airflow’s BranchPythonOperator lets you dynamically decide which downstream task(s) to run based on logic defined in a Python function.

Trigger Rules: Controlling Task Execution Logic

By default, a task will run only if all upstream tasks succeed. This is controlled by the trigger_rule parameter.

Here are a few commonly used trigger rules:

Trigger Rule	Meaning
all_success	Run if all upstream tasks succeeded
one_success	Run if any one upstream task succeeded
all_failed	Run only if all upstream tasks failed
all_done	Run once all upstream tasks are complete (success or fail)
…	…

Here’s an example which uses branching:

from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.operators.empty import EmptyOperator
from airflow.utils.dates import days_ago

def choose_branch():
    # Logic to decide the path
    return "task_a"

with DAG(
    "branching_example",
    start_date=days_ago(1),
    schedule_interval="@daily",
    catchup=False
) as dag:
    
    branch = BranchPythonOperator(
        task_id="branching",
        python_callable=choose_branch
    )

    task_a = EmptyOperator(task_id="task_a")
    task_b = EmptyOperator(task_id="task_b")
    final = EmptyOperator(task_id="final")

    branch >> [task_a, task_b]  # only one will run
    [task_a, task_b] >> final

BranchPythonOperator decides which path (task) to run next. Only one branch is followed (task_a in this case), the other is skipped. Use this when your workflow needs to choose a path at runtime.

Here’s an example which uses triggers:

from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.utils.dates import days_ago

with DAG(
    "trigger_rule_example",
    start_date=days_ago(1),
    schedule_interval="@daily",
    catchup=False)
as dag:

    t1 = EmptyOperator(task_id="task_1")
    t2 = EmptyOperator(task_id="task_2")
    t3 = EmptyOperator(
        task_id="task_3",
        trigger_rule=TriggerRule.ONE_SUCCESS  # Runs if either t1 or t2 succeeds
    )

    [t1, t2] >> t3

By default, a task runs only if all upstream tasks succeed (all_success). trigger_rule="one_success" lets task_3 run if at least one of task_1 or task_2 succeeds. Useful when merging branches or handling optional paths.

Catch up & Backfilling

Airflow’s scheduler is designed to run tasks on a defined schedule – daily, hourly, weekly, etc.

But what happens if you start a DAG today that was supposed to begin running a week ago? That’s where Catchup and Backfilling come in.

What is Catchup?

When catchup=True (which is the default), Airflow will automatically run all the past DAG runs between the start_date and now – one run per scheduled interval.

Example:

from airflow.models.dag import DAG
from pendulum import datetime

dag = DAG(
    dag_id="example_dag",
    start_date=datetime(2024, 1, 1),
    schedule_interval="@daily",
    catchup=True
)

If you launch this DAG on April 1, 2025, Airflow will try to "catch up" by creating DAG runs for every day from Jan 1, 2024 to March 31, 2025. That’s 456 runs !

Disabling Catchup

If you don’t want to run all those past DAGs, you can turn off catchup:

catchup=False

Now Airflow will only schedule runs from the current time forward.

What is Backfilling?

Backfilling is basically the manual version of catchup. You can trigger backfills using the CLI or UI.

Example via CLI:

airflow dags backfill -s 2024-01-01 -e 2024-01-10 example_dag

This runs your DAG for each day between Jan 1 and Jan 10, regardless of its current state.

Another way to backfill is to provide an end_date along with start_date in the dag settings when you initialize the DAG class like you just saw in the above example.

Why You Should Care

Catchup can overload your system if you forget to turn it off on high-frequency DAGs.
Backfilling is useful for reruns or historical data processing.
Knowing when to use them (or not) helps you avoid surprises and manage Airflow more effectively.

Applications of Airflow

Apache Airflow is used across industries to automate and manage complex workflows. Here’s where it shines:

Data Pipelines: Automate ETL/ELT workflows, data validation, and batch processing tasks.
Machine Learning: Schedule model training, data preprocessing, evaluation, and deployment workflows.
System Integrations: Move files, call APIs, and orchestrate services across cloud and on-prem systems.
DevOps Automation: Perform system checks, manage database backups, and automate infrastructure tasks.
Business Workflows: Power marketing campaigns, generate financial reports, and run scheduled business operations.

With features like dynamic workflows, scalable execution, and rich monitoring, Airflow is a go-to choice for building reliable, production-grade pipelines.

Real World Examples

Healthcare

Airflow can manage complex workflows for EHR data processing, regulatory compliance, and predictive diagnostics using ML pipelines. Reduces manual workloads by automating insights from patient data to improve care delivery.

eCommerce / Retail

Airflow can automate inventory tracking, customer behavior analysis, and personalized marketing pipelines to boost operational efficiency and customer engagement. Coordinate data across vendors and logistics for better demand forecasting and supply chain performance.