Airflow Operators - A Comparison
Airflow provides a variety of operators to couple your business logic into executable tasks in a workflow. Often times it is confusing to decide when to use what. In this article we will discuss the p
Airflow provides a variety of operators to couple your business logic into executable tasks in a workflow. Often it is confusing to decide when to use what. In this article, we will discuss the pros and cons of each in detail.
Using
PythonOperator
, all the business logic and it's associated code resides in the airflow DAG directory. The PythonOperator
imports and runs them during the executionairflow
\__dags
\_classification_workflow.py
\_ tweet_classification
\_preprocess.py
\_predict.py
\_ __init__.py
\__logs
\__airflow.cfg
- 1.Best option when the code is in the same repo as airflow
- 2.Simple and easy to use
- 3.Works well on small teams
- 1.Couples airflow code with business logic
- 2.Any business logic change would mean redeploying airflow code
- 3.Sharing a single airflow instance across multiple projects will be a nightmare
- 4.Can run only Python code, well, duh.
When using
DockerOperator
, all the business logic and it's associated code resides in a docker image. During execution- 1.Airflow pulls the specified image
- 2.Spins up a container
- 3.Executes the respective command.
DockerOperator(
dag=dag,
task_id='docker_task',
image='gs://project-predict/predict-api:v1',
auto_remove=True,
docker_url='unix://var/run/docker.sock',
command='python extract_from_api_or_something.py'
)
- 1.Works well across cross-functional teams
- 2.Can run projects that are not built-in Python
- 1.Needs docker installed in the worker machine
- 2.Depending on the resources available, The load of the worker machine might be heavy when multiple containers run at the same time
When using
KubernetesPodOperator
, all the business logic and it's associated code resides in a docker image. During execution, airflow spins up a worker pod, which pulls the mentioned docker image and executes the respective command.KubernetesPodOperator(
task_id='classify_tweets',
name='classify_tweets',
cmds=['python', 'app/classify.py'],
namespace='airflow',
image='gcr.io/tweet_classifier/dev:0.0.1')
- 1.Works well across cross-functional teams
- 2.Single airflow instance can be shared across teams without hassle
- 3.Decouples DAG and the business logic
Complex on the infrastructure, since it uses docker and Kubernetes.
Newsletter embed