Distributed Data-pipelines in Python
Pycon Sweden (2021) & IWD (2020)
With ML being the new shiny object, No one really talks about the data pipelines that make the data consumable. A single streamlined data pipeline may sound cozy, but no one wants to sit through 10 hours ingesting 2 million records, do you? Trust me, you don’t. The solution? Distributed data pipelines.
Airflow is an open-source platform to create, schedule and monitor workflows. In this talk, we will explore - What is airflow? - How to create data pipelines? - How to exploit Kubernetes to achieve performance - Pros and cons of the approach