DLTHub First Look
Businesses produce and accumulate data. Before making sense of this data, we must bring them to a centralized place—a data lake. Whenever you can think of data, movement tools like Fivetran comes into
Should you use DLTHub for your ELT usecase?
Businesses produce and accumulate data. Before making sense of this data, we must bring them to a centralized place—a data lake. Whenever you can think of data, movement tools like Fivetran and Airbyte comes into the picture. As I discussed in the Airbyte article, the problem with the current ELT ecosystem is that they come with a platform dependency.
One of the main advantage of ELT tools is it's ability to connect different data sources and destination. The platform or language dependency comes in the way of extending/custqomizing the connectors to match your needs.
I've personally faced with problem in more than one occasion
Problems with ELT Tools
Airbyte
Customization
Some of the connectors in Airbyte are written in Java. If I have to extend a connector I'll be limited to the language it's written in.
Open Source??
Airbyte is acumilating the market share by using open-source as it's advantage. But they have two different APIs for the open-source and the cloud version. If you try to automate anything with the open-source version, you'll be limited to the API, which is not well documented. Here is my failed attempt to write a custom Python SDK for Airbyte.
Airflow Providers
Customization
My alternative to Airbyte fiasco was to use Airflow without it's platform, just the providers. As you might know the provider ecosystem is huge. All of them are in Python so it's pretty straightforward to extend them.
Dependency Hell
But the hope and excitement to use Airflow providers was short-lived. Airflow is known for it's dependency hell. For one, we were using poetry which is not supported. Providers cannot be installed without installing airflow itself. That means I'm brining way too many libraries and it's assoicated constraints into my project and are limited to the version of Airflow I'm using.
DLTHub
DLTHub was the next ray of hope. It's purely opensource and has no platform dependency. It's all Python and comes with the flexibility of run it anywhere, from your local, server, airflow.
Installation
Poetry the python dependency manager is something I am heavily using for all my client projects. So support for poetry was a big plus for me. Setting up the dev environment was as simple as running
Ease of Use
The example from the documentation is pretty straightforward. I was able to run the example without any issues. Let's start by defining some data
Ingest it to duckdb. Why duckdb? because it's inmemory and pretty easy to setup
But it's not ideal to have data as a dict, in realtime it can be from any data source like Github, salesforce or another db. Here is my modified version of the example with reading from duckdb and ingesting it back
Last updated