DLTHub First Look
Businesses produce and accumulate data. Before making sense of this data, we must bring them to a centralized place—a data lake. Whenever you can think of data, movement tools like Fivetran comes into
Should you use DLTHub for your ELT usecase?
Businesses produce and accumulate data. Before making sense of this data, we must bring it to a centralized place—a data lake. Whenever you can think of data, movement tools like Fivetran and Airbyte come into the picture. As I discussed in the Airbyte article, the problem with the current ELT ecosystem is that it is platform-dependent.
One of the main advantages of ELT tools is their ability to connect different data sources and destinations. Platform or language dependency prevents you from extending/customizing the connectors to match your needs.
I've personally faced with a problem on more than one occasion
Problems with ELT Tools
Airbyte
Customization
Some of the connectors in Airbyte are written in Java. If I have to extend a connector I'll be limited to the language it's written in.
Open Source??
Airbyte is accumulating market share by using open-source as its advantage. But they have two different APIs for the open-source and cloud versions. If you try to automate anything with the open-source version, you'll be limited to the API, which is not well documented. Here is my failed attempt to write a custom Python SDK for Airbyte.
Airflow Providers
Customization
My alternative to the Airbyte fiasco was to use Airflow without its platform, just the providers. As you might know, the provider ecosystem is huge. All of them are in Python so it's pretty straightforward to extend them.
Dependency Hell
However, the hope and excitement to use Airflow providers was short-lived. Airflow is known for its dependency hell. For one, we were using poetry which is not supported. Providers cannot be installed without installing the Airflow itself. That means I'm bringing way too many libraries and it's associated constraints into my project and are limited to the version of Airflow I'm using.
DLTHub
DLTHub was the next ray of hope. It's purely open source and has no platform dependency. It's all Python and comes with the flexibility of running it anywhere, from your local, server, to airflow.
Installation
Poetry the Python dependency manager is something I am heavily using for all my client projects. So support for poetry was a big plus for me. Setting up the dev environment was as simple as running
Ease of Use
The example from the documentation is pretty straightforward. I was able to run the example without any issues. Let's start by defining some data
Ingest it to Duckdb. Why Duckdb? because it's in memory and pretty easy to setup
But it's not ideal to have data as a dict, in realtime it can be from any data source like Github, salesforce or another db. Here is my modified version of the example with reading from duckdb and ingesting it back
Last updated