Python vs SQL is a frequent debate in the data community and it usually revolves around which language is better for data transformations. At Features and Labels, we think about it a little differently. In our opinion, today, all transformations that are supported by SQL should be done in SQL. Not because we think query languages are superior to writing imperative code, but because the developer experience provided by dbt in combination with a modern data warehouse is unmatched in its accessibility and scalability.
Until a similar experience can be offered for Python, we believe data teams should only reach for other languages, as a last resort, when they really can’t achieve their tasks with SQL. That being said, we are extremely excited for the next generation of Python compute environments that seem to be around the corner for delivering a delightful developer experience for Python transformations and other workloads at scale.
When we first launched
fal, we were thinking about the specific cases, where Python is unavoidable. Such cases include data science workloads with libraries like
xgboost. We solved this case by adding support for running Python workloads before and after a dbt run. But what about the situations when you need to step out of the SQL world, do some Python and go back to SQL?
At first we weren’t sure about this case. Without a cloud environment to support the scale of compute, it would be a mistake to encourage users to use Python for transformations that could be achieved with SQL inside a data warehouse.
But after talking to dozens of
fal users, we realize that sometimes you really need to run Python between SQL queries. These situations do not involve transformations in the traditional sense. They are more complex. For example, you might need to run a clustering ML model in between dbt nodes. A pattern we observed is that a Python workflow is usually an enrichment of a previous dbt transformation. Such a workflow doesn’t necessarily generate new assets.
And so we decided to add support for this precise use case: Python nodes that only enrich existing dbt models and not create new tables that can later be depended on. We made this use case possible by using the
fal flow command:
In the scenario represented in the above diagram
clustering.py script enriches the
dataset.sql model by populating a column with cluster ids. The
reporting.sql model depends on this enrichment. In order to associate
dataset.sql, we need to specify it in a meta tag in
models: - name: dataset.sql meta: fal: scripts: - clustering.py
Previously if a Python script was added to a model configuration,
fal flow run would run that script after the entire dbt run. Starting from
0.2.12 , if
fal flow command is ran with the
fal will construct a new graph of nodes and run the dbt nodes and Python nodes in their topological order.
We specifically like this solution as it maintains the simplicity of the dbt developer experience and adds unique capabilities of the Python ecosystem to dbt workflows.
A very critical moment for a data team is when it decides that it needs to use a language other than SQL and a compute environment other than the data warehouse. At this moment the complexity of the project jumps up, new tooling needs to be introduced and the existing workflows are disturbed. Years of technical debt can be added in a matter of days. At Features and Labels our objective is to help teams control the entropy of their project when they reach that point. We want to do this by making data warehouses and Python runtimes interoperable.