How to Set up Airflow on Kubernetes?
The blog walks you through the steps on how to deploy Airflow on Kubernetes.
This blog walks you through the steps on how to deploy Airflow on Kubernetes. If you to jump on the code directly here's the GitHub repo.
What is Airflow
Airflow is a platform created by the community to programmatically author, schedule, and monitor workflows.
Airflow lets you define workflow as a directed acyclic graph(DAG) defined in a Python file. The most famous usecase of airflow is data/machine learning engineers constructing data pipelines that performs transformations.
An Introduction to Apache AirflowAirflow Celery vs Kubernetes ExecutorAirflow with Kubernetes
There are a bunch of advantages of running Airflow over Kubernetes.
Scalability
Airflow runs one worker pod per airflow task, enabling Kubernetes to spin up and destroy pods depending on the load.
Resource Optimization
Kubernetes spins up worker pods only when there is a new job. Whereas the alternatives such as celery always have worker pods running to pick up tasks as they arrive.
Pre-Requsites
Kubectl
Docker
A Docker image registry to push your Docker images
Kubernetes cluster on GCP/AWS.
Airflow Architecture
Airflow has 3 major components.
Webserver - Which serves you the fancy UI with a list of DAGs, logs, and tasks.
Scheduler - Which runs on the background and schedules tasks and manages them
Workers/Executors - These are the processes that execute the tasks. Worker processes are spun up by Schedulers and tracked on their completion
Apart from these, there are
Dag folders
Log folders
Database
There are different kinds of Executors one can use with Airflow.
LocalExecutor - Used mostly for playing around in the local machine.
CeleryExecutor - Uses celery workers to run the tasks
KubernetesExecutor - Uses Kubernetes pods to run the worker tasks
aAirflow with Kubernetes
On scheduling a task with airflow Kubernetes executor, the scheduler spins up a pod and runs the tasks. On completion of the task, the pod gets killed. It ensures maximum utilization of resources, unlike celery, which at any point must have a minimum number of workers running.
Building the Docker Image
The core part of building a docker image is doing a pip install.
We also need a script that would run the webserver
or scheduler
based on the Kubernetes pod or container. We have a file called bootstrap.sh
to do the same.
Let's add them to the docker file too.
Let's build and push the image
Kubernetes configuration
This section explains various parts of
build/airflow.yaml.
A Kubernetes deployment running a pod running both webserver and scheduler containers
A service whose external IP is mapped to Airflow's webserver
A serviceaccount which with
Role
to spin up and delete new pods. These provide permissions to the Airflow scheduler to spin up the worker pods.
Two persistent volumes for storing dags and logs
An airflow config file is created as a kubernetes config map and attached to the pod. Checkout
build/configmaps.yaml
The Postgres configuration is handled via a separate deployment
The secrets like Postgres password are created using Kubernetes secrets
If you want to additional env variables, use Kubernetes
configmap
.
Deployment
You can deploy the airflow pods in 2 modes.
Use persistent volume to store DAGS
Get use git to pull dags from
To set up the pods, we need to run a deploy.sh
script that does the following
Convert the templatized config under
templates
to Kube config files underbuild.
Deletes existing pods, deployments. if any in the namespace
Create new pods, deployments, and other Kube resources
Testing the Setup
By default, this setup copies all the examples into the dags; we can just run one of them and see if everything is working fine.
Get the airflow URL by running
kubectl get services
Log into the Airflow by using
airflow
andairflow.
You can change this value inairflow-test-init.sh.
Pick one of the DAG files listed
On your terminal run
kubectl get pods --watch
to notice when worker pods are createdClick on
TriggerDag
to trigger one of the jobsOn the graph view, you can see the tasks running, and on your terminal new pods are created and shut down completing the tasks.
Maintainance and modification
Once it is deployed, you don't have to run this script every time. You can use basic kubectl commands to delete or restart pods.
Got a Question?
Raise them as issues on the git repo
Next Up...
Dynamic Task MappingSending Email Alerts in Apache Airflow with SendgridLast updated
Was this helpful?