Data Orchestration with Airflow

Harshit Singhai

May 20, 2021

Airflow, an open source solution for developing and monitoring workflows.

Apache Airflow is used for building data pipelines. It enables us to easily build scheduled data pipelines using a flexible Python framework, while also providing many building blocks that allow us to stitch together the many different technologies encountered in modern technological landscapes.

Airflow is not a data processing tool in itself but orchestrates the different components responsible for processing our data in data pipelines.

Data Pipelines

Data pipelines generally consist of several tasks or actions that need to be executed to achieve the desired result.

One way to make dependencies between tasks more explicit is to draw the data pipeline as a graph. In this graph-based representation, tasks are represented as nodes in the graph, while dependencies between tasks are represented by directed edges between the task nodes. The direction of the edge indicates the direction of the dependency, with an edge pointing from task A to task B, indicating that task A needs to be completed before task B can start.

data-engineering-pipeline

Another useful property of Airflow is that it clearly separates pipelines into small incremental tasks rather than having one monolithic script or process that does all the work.

airflow-alternatives

Airflow is not the only data orchestration tool. Some alternatives

airflow-alternatives

Airflow Birdeye view

In Airflow, we define our DAGs using Python code in DAG files, which are essen- tially Python scripts that describe the structure of the corresponding DAG. As such, each DAG file typically describes the set of tasks for a given DAG and the dependen- cies between the tasks, which are then parsed by Airflow to identify the DAG structure

Other than this, DAG files typically contain some additional metadata about the DAG telling Airflow how and when it should be executed, and so on.

One advantage of defining Airflow DAGs in Python code is that this programmatic approach provides us with a lot of flexibility for building DAGs. For example, we can use Python code to dynamically generate optional tasks depending on certain conditions or even generate entire DAGs based on external metadata or configuration files. This flexibility gives a great deal of customization in how we build our pipelines, allowing us to fit Airflow to our needs for building arbitrarily complex pipelines.

Airflow Ecosystem

There has been a development of many Airflow extensions that enable us to execute tasks across a wide variety of systems, including external databases, big data technologies, and various cloud services, allowing us to build complex data pipelines bringing together data processes across many different systems.

Components of Airflow

The Airflow scheduler — Parses DAGs, checks their schedule interval, and (if the DAGs’ schedule has passed) starts scheduling the DAGs’ tasks for execution by passing them to the Airflow workers
The Airflow workers—Pick up tasks that are scheduled for execution and execute them. As such, the workers are responsible for actually “doing the work.”
The Airflow webserver—Visualizes the DAGs parsed by the scheduler and provides the main interface for users to monitor DAG runs and their results.

airflow-alternatives

Fictionally Irrelevant.

Data Orchestration with Airflow

Data Pipelines

Airflow Birdeye view

Airflow Ecosystem

Components of Airflow