Last year, we published a post on how to load CSV files into a dbt project by using fal and a model pre-hook. This is a sequel to that post and it provides a more structured and convenient way to define dbt models that contain data from CSV files.
Loading data from CSV files into a data warehouse is a common task for many businesses and organizations. It's so common that many tools now allow this functionality. But of course, not all of them are created equal. This is especially true for dbt projects and, as you might have guessed, this is where dbt-fal comes in.
dbt has it's own CSV loading command,
dbt seed. This command reads CSV files inside your dbt project and loads them into the data warehouse. dbt does not recommend using
seed for loading raw data though, it's claimed to be not performant for large files. However, this doesn't mean that dbt is not useful when it comes to loading CSV data. When used in conjunction with dbt-fal, dbt can become an effective tool for loading even large amounts of CSV data into data warehouses.
dbt-fal is a dbt adapter that allows users to run Python models with any data warehouse. In this blog post, we will explore how dbt-fal and dbt can be combined to create a streamlined and efficient process for loading CSV data into a data warehouse, even when dealing with large datasets.
Load CSV data
Say we have some customer order data CSV that looks like this:
Order ID, Customer ID, Order Date, Product Name, Quantity, Total Price 1, 1, 2021-01-01, Phone, 2, 1000 2, 2, 2021-02-15, Laptop, 1, 2000 3, 3, 2021-03-10, Headphones, 3, 600 4, 4, 2021-04-05, Smartwatch, 1, 300 5, 5, 2021-05-20, Tablet, 2, 800
In order to load such CSV data, we just create a Python model
orders_raw.py inside our models directory. Inside
orders_raw.py we define the model function:
import pandas as pd def model(dbt, fal): dbt.config(materialized="table") df = pd.read_csv('orders.csv') return df
That's all it takes to load a CSV file into a data warehouse! You can define downstream models that depend on
orders_raw.py by using
ref('orders_raw') in SQL models, and
dbt.ref('orders_raw') in Python models.
One quick improvement to the example above is parsing dates. At the moment, the resulting table with have dates as strings. Of course we can write another model that can convert these strings into Date types, but there's a more efficient way to do this:
import pandas as pd def model(dbt, fal): dbt.config(materialized="table") df = pd.read_csv('orders.csv', parse_dates=["Order Date"]) return df
parse_dates argument, we're telling pandas to treat the "Order Date" column as a date.
read_csv method of pandas is very versatile. It lets you work with both local and remote CSV files, manually define column names, choosing which engine to use and many more. You can find out more about it in pandas documentation. For example, this is how easy it is to load a CSV file from AWS S3:
import pandas as pd def model(dbt, fal): dbt.config(materialized="table") df = pd.read_csv('s3://my_data_bucket/orders.csv') return df
What about scale?
The scale of possible CSV imports is limited by the compute environment in which dbt is running. If you're running it on your local computer, loading gigabytes of CSV files might get very hot. Wouldn’t it be nice to run Python models in a serverless environment? What if you could run such model code on a flexible and scalable platform that’s easy to use? That's exactly what we're building with dbt-fal Cloud. Here’s a preview of how it will work:
- In your
profiles.ymledit the dbt-fal profile to include cloud host and credentials:
fal_test: target: prod outputs: prod: type: fal db_profile: db host: cloud # This is new key_secret: my_secret key_id: my_secret_id db: type: redshift host: my_redshift_host ...
3. That’s it! There are no more steps! Your Python models will automatically be run in dbt-fal Cloud runtime.
In the coming weeks we will be talking more about dbt-fal Cloud, so stay tuned. You can also sign up to our waiting list.
By combining the capabilities of dbt-fal and dbt, users can create a simple and efficient process for loading CSV data into their data warehouses. Whether you are working with small or large datasets, this approach can help streamline your data loading process and reduce the risk of errors.