Separation of compute and storage was initially an architectural choice by database vendors to make large scale data warehousing cheaper. Data platform vendors didn’t foresee that years later this choice will force them to be more open about where companies store their data and how they query it. Fast forward a couple of years and most data platforms now let their users to bring their own storage to the platform. This trend is accelerated by open table formats like Iceberg, Delta and Hudi. The next frontier will be querying data with a query engine of choice. The walled gardens of data platforms are starting to come down.
An alternative title for this post was “bring your own query engine”. I will take some time to talk about query engines and why there seems to be a lot of good options all of a sudden. Polars, velox, duckdb’s engine, datafusion are notable open source projects, there are other similar proprietary engines inside snowflake et al. All of these query engines are built with the same overall principle called vectorization. Some engines are faster in certain types of queries but overall they are becoming increasingly similar in terms of capabilities. Platform benchmarks are almost meaningless nowadays, no one has the secret sauce. There are some optimization tricks each engine seems to have but with enough time, the same patterns get implemented across all of them. This is good for the customer, they don’t need to pay a premium for something that is widely available.
As data platforms struggle to stand out in terms of speed, there is still much room for improvement when it comes to how easy they are to work with. This is where dbt came in. dbt has successfully innovated the user experience for building data models, which is arguably the most common workflow in data warehouse. Today, dbt is the de-facto interface for building models. A lot of the complexity of using different storage formats and query engines can be hidden behind the abstraction that dbt provides. One can use platform’s SQL engine for general purpose data modeling in dbt SQL models, and then use polars for last mile transformations with Python before finally training an ML model. This type of workflow is already very natural when using our dbt-fal adapter. All this complexity is completely hidden behind the abstractions dbt provides.
This is all possible today, but one can dream of even more flexibility. When dbt allows multiple adapters in the same project, referencing a model from one platform can trigger the most efficient data transfer between two platforms. Ubiquity of dbt can force data platforms to adapt more open data transfer protocols like arrow flight. Then different platforms can be queried with technologies like Datafusion as the data starts arriving in a streaming fashion for the most efficient outcomes.
In a couple years the end user can be using multiple data storage formats, plain S3 storage, different query engines, Apache Flight and several technologies that are not even invented yet. dbt’s well-thought-out abstractions will simplify the complex underlying systems, enabling users to interact with dbt in the same way, regardless of the underlying technologies in use. Each query engine and storage source needs to be built as independent dbt adapters. Of course, there is an option for a simpler and cheaper data platform, and many great teams are working towards this future. Nevertheless, nobody wants to pay more for technologies that become commodities.
Yes, the data warehouse is also unbundling. Metrics and semantic layers are good bets but dbt’s real potential is to double down as the abstraction layer of on top of the composable data warehouse.