Populate dbt models with CSV data. Part 2: the power of dbt-fal

In this blog post, we will explore how dbt-fal and dbt can be combined to create a streamlined and efficient process for loading CSV data into a data warehouse, even when dealing with large datasets.

Populate dbt models with CSV data. Part 2:  the power of dbt-fal
Photo by Barth Bailey / Unsplash

Last year, we published a post on how to load CSV files into a dbt project by using fal and a model pre-hook. This is a sequel to that post and it provides a more structured and convenient way to define dbt models that contain data from CSV files.

Loading data from CSV files into a data warehouse is a common task for many businesses and organizations. It's so common that many tools now allow this functionality. But of course, not all of them are created equal. This is especially true for dbt projects and, as you might have guessed, this is where dbt-fal comes in.

dbt has it's own CSV loading command, dbt seed. This command reads CSV files inside your dbt project and loads them into the data warehouse. dbt does not recommend using seed for loading raw data though, it's claimed to be not performant for large files. However, this doesn't mean that dbt is not useful when it comes to loading CSV data. When used in conjunction with dbt-fal, dbt can become an effective tool for loading even large amounts of CSV data into data warehouses.

dbt-fal is a dbt adapter that allows users to run Python models with any data warehouse. In this blog post, we will explore how dbt-fal and dbt can be combined to create a streamlined and efficient process for loading CSV data into a data warehouse, even when dealing with large datasets.

Load CSV data

Say we have some customer order data CSV that looks like this:

Order ID, Customer ID, Order Date, Product Name, Quantity, Total Price
1, 1, 2021-01-01, Phone, 2, 1000
2, 2, 2021-02-15, Laptop, 1, 2000
3, 3, 2021-03-10, Headphones, 3, 600
4, 4, 2021-04-05, Smartwatch, 1, 300
5, 5, 2021-05-20, Tablet, 2, 800

In order to load such CSV data, we just create a Python model orders_raw.py inside our models directory. Inside orders_raw.py we define the model function:

import pandas as pd

def model(dbt, fal):
  dbt.config(materialized="table")
  df = pd.read_csv('orders.csv')
  return df

That's all it takes to load a CSV file into a data warehouse! You can define downstream models that depend on orders_raw.py by using ref('orders_raw') in SQL models, and dbt.ref('orders_raw') in Python models.

One quick improvement to the example above is parsing dates. At the moment, the resulting table with have dates as strings. Of course we can write another model that can convert these strings into Date types, but there's a more efficient way to do this:

import pandas as pd

def model(dbt, fal):
  dbt.config(materialized="table")
  df = pd.read_csv('orders.csv', parse_dates=["Order Date"])
  return df

By passing parse_dates argument, we're telling pandas to treat the "Order Date" column as a date.

The read_csv method of pandas is very versatile. It lets you work with both local and remote CSV files, manually define column names, choosing which engine to use and many more. You can find out more about it in pandas documentation. For example, this is how easy it is to load a CSV file from AWS S3:

import pandas as pd

def model(dbt, fal):
  dbt.config(materialized="table")
  df = pd.read_csv('s3://my_data_bucket/orders.csv')
  return df

What about scale?

The scale of possible CSV imports is limited by the compute environment in which dbt is running. If you're running it on your local computer, loading gigabytes of CSV files might get very hot.  Wouldn’t it be nice to run Python models in a serverless environment? What if you could run such model code on a flexible and scalable platform that’s easy to use? That's exactly what we're building with dbt-fal Cloud. Here’s a preview of how it will work:

  1. In your profiles.yml edit the dbt-fal profile to include cloud host and credentials:
fal_test:
  target: prod
  outputs:
    prod:
      type: fal
      db_profile: db
      host: cloud # This is new
      key_secret: my_secret
      key_id: my_secret_id
    db:
      type: redshift
      host: my_redshift_host
			...

2. Run dbt run

3. That’s it! There are no more steps! Your Python models will automatically be run in dbt-fal Cloud runtime.

In the coming weeks we will be talking more about dbt-fal Cloud, so stay tuned. You can also sign up to our waiting list.

Summary

By combining the capabilities of dbt-fal and dbt, users can create a simple and efficient process for loading CSV data into their data warehouses. Whether you are working with small or large datasets, this approach can help streamline your data loading process and reduce the risk of errors.