Build and Deploy Machine Learning Models from Jupyter Notebooks with fal and dbt

Introduction
Machine Learning (ML) is increasingly important in data-driven decision making, so it's important to use modern tools and techniques to streamline the machine learning workflows. This is where dbt and fal can come in - together they make it easy to manage and deploy machine learning models in a scalable and reproducible way. In this blog post, we'll walk you through how to use fal and dbt to train and store a logistic regression ML model, make predictions on fresh data, and store those predictions in a dbt model. By the end of this post, you'll be equipped with the skills and knowledge to apply these tools to your own ML projects.
Setup
We have prepared an example project that you can play with as you read this blog post. It has both a dbt project with some synthetic data as well as an example Jupyter notebook. You can clone it:
git clone https://github.com/fal-ai/dbt_fal_ml_example
We are using dbt-fal as a Python adapter. It's the easiest way to run a dbt Python model. This project also uses BigQuery as a data warehouse. You can edit the requirements.txt file to suit your own data warehouse. You can then install the project requirements in a new Python environment by running:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Let’s also create a profiles.yml
file inside our project directory and fill it with necessary credentials:
example_shop:
target: staging
outputs:
staging:
type: fal
db_profile: db
db:
type: bigquery
method: service-account-json
...
The db
output should contain your data warehouse credentials.
Finally, lets start the Jupyter notebook:
jupyter notebook notebooks/Experiments.ipynb
This will print out a URL in your terminal that you can use in a browser and open the "Experiments.ipynb" notebook.
Our example dataset simulates customer orders and order returns in a retail setting. The dataset contains information on customer ages, total order prices, and whether or not each order was returned.
Data exploration and preparation
In our notebook, we start by importing all of the necessary modules:
import pickle
import uuid
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from fal import FalDbt
Jupyter notebooks allow us to run shell commands and so we can run dbt seed
and dbt run
:
!dbt seed --profiles-dir ..
!dbt run --select customer_orders customer_orders_labeled --profiles-dir ..
As you can see, we're calculating two dbt models, customer_orders
and customer_orders_labeled
. As names suggest, one dataset contains labeled data, whereas the other has unlabeled "fresh" data.
Let's have a look at the customer_orders_labeled
model. We must instantiate FalDbt
:
faldbt = FalDbt(project_dir="..", profiles_dir="..")
Now we can download the customer_orders_labeled
model as a pandas DataFrame and print the top rows:
orders_df = faldbt.ref("customer_orders_labeled")
orders_df.head()
This prints a table that looks like this:
order_id customer_id total_price age return
0 210.0 488.0 187.861698 18.0 0.0
1 263.0 578.0 628.745330 18.0 0.0
2 360.0 578.0 99.154886 18.0 0.0
3 482.0 818.0 393.284591 18.0 0.0
4 594.0 656.0 339.542104 18.0 0.0
The return
column is numeric, where 0.0
means that order has not been returned and 1.0
means that the order has been returned. Since it's the value of the return
column that we want to predict, we call this column a label, the other columns are features. Let's assume that the features order_id
and customer_id
do not play role in whether or not an order is returned. This leaves us with total_price
and age
.
A good way to visualize a relationship between features and labels is to make a plot. We can do this easily by using the matplotlib
library:
plot_data = orders_df.sample(frac=0.1, random_state=123)
colors = ['red' if r else 'blue' for r in plot_data['return']] # assign colors based on whether or not order was returned
plt.scatter(plot_data['age'], plot_data['total_price'], c=colors)
plt.xlabel('Age')
plt.ylabel('Total Price')
plt.show()
Here's the resulting plot:

Red dots correspond to orders that have been returned. We can see from our plot that the orders in top left tend to be returned more often than other orders.
ML model training and evaluation
The ML model type that we will train and evaluate is called logistic regression. Logistic regression is suitable for this problem because the target label (return
) is binary (0 or 1), and logistic regression models can output probabilities that lie between 0 and 1. In our case, the output of the logistic regression model will be the probability of an order being returned, given the customer's age and the total order price.
We start by splitting the dataset into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(
orders_df[['age', 'total_price']],
orders_df['return'],
test_size=0.2,
random_state=42)
This will split the dataset into training and testing sets, with 80% of the data used for training and 20% of the data used for testing.
Next, let's train a logistic regression model on the training set. We will use the LogisticRegression
class object from scikit-learn
, it is a quick and simple implementation of logistic regression:
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
The fit
method of the lr_model
object does the training. Once this cell finishes calculating, lr_model
will be trained with our data.
Once a model is trained, we can evaluate its performance on test data using the predict
method:
# Make predictions on the test data
y_pred = lr_model.predict(X_test)
# Print a classification report
print(classification_report(y_test, y_pred))
This will output a classification report that summarizes the performance of the model:
precision recall f1-score support
0.0 0.87 0.97 0.91 227
1.0 0.85 0.53 0.66 73
accuracy 0.86 300
macro avg 0.86 0.75 0.79 300
weighted avg 0.86 0.86 0.85 300
The classification report shows that the model has an accuracy of 0.86 on the test data. The precision of the model is 0.87 for class 0 (no return) and 0.85 for class 1 (return). The recall of the model is 0.97 for class 0 and 0.53 for class 1. The F1-score of the model is 0.91 for class 0 and 0.66 for class 1.
We can see that the model has good precision and recall for class 0 (no return), but lower precision and recall for class 1 (return). This suggests that the model may be better at predicting orders that will not be returned than orders that will be returned. Nonetheless, the model has an overall accuracy of 0.87, indicating that it can make reasonably accurate predictions on new data.
Automate ML training by using a dbt model
Storing the model training workflow in a dbt model allows us to version the model data and share it with other users. We start by creating a new dbt model in models
directory: order_return_prediction_models.py
. This is a Python model and we adapt the above notebook code into the model definition:
import pickle
import uuid
import pandas as pd
import datetime
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
def model(dbt, fal):
dbt.config(materialized="table")
orders_df = dbt.ref("customer_orders_labeled")
X = orders_df[['age', 'total_price']]
y = orders_df['return']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
print("Model init")
lr_model = LogisticRegression(random_state=123)
print("Model fitting")
lr_model.fit(X_train, y_train)
# Test model
y_pred = lr_model.predict(X_test)
print("Preparing the classification report")
# Create a report and put it in a DataFrame
model_name = str(uuid.uuid4())
y_test = y_test.astype(float)
report = classification_report(y_test, y_pred, output_dict=True)
report["model_name"] = model_name
report["date"] = datetime.datetime.now()
output_df = pd.DataFrame([report])
output_df = output_df.rename(columns={"0.0": "target_0", "1.0": "target_1"})
output_df.set_index("model_name")
print("Saving the model")
# Save model weights
with open(f"ml_models/{model_name}.pkl", "wb") as f:
pickle.dump(lr_model, f)
return output_df
The order_return_prediction_models
model gets labeled orders data and trains a logistic regression model lr_model
. The model is then evaluated and the report is stored in a DataFrame along with a unique model name. Next, we save the model weights to local storage. You can modify this step to store the model weights on a cloud storage platform, such as S3. Finally, the output DataFrame is returned and its contents are therefore persisted in our data warehouse.
We can run this dbt model:
dbt run --select order_return_prediction_models
Making predictions with a stored model
First, we try making a prediction in a Jupyter notebook and then we make a Python model to do this automatically.
In our Jupyter notebook, we can easily find the model with best accuracy:
models_df = faldbt.ref("order_return_prediction_models")
best_model_name = models_df[
model_df.accuracy == models_df.accurary.max()
].model_name[0]
We then load this model from local storage (or a cloud storage provider):
with open(f"../ml_models/{model_name}.pkl", "rb") as f:
loaded_model = pickle.load(f)
We also load the new order data and check what it looks like:
orders_new_df = faldbt.ref("customer_orders")
orders_new_df.head()
This prints out a table:
order_id customer_id total_price age
0 1037.0 981.0 193.460803 19.0
1 1027.0 940.0 680.986976 21.0
2 1039.0 123.0 952.906524 22.0
3 1043.0 860.0 545.791012 22.0
4 1046.0 316.0 887.003551 24.0
As we see, this DataFrame has a similar shape to customer_orders_labeled
except it lacks the return
column. This is the column that we would like to predict.
So, lets do a prediction:
predictions = loaded_model.predict(orders_new_df[["age", "total_price"]])
orders_new_df["predicted_return"] = predictions
order_new_df.head()
In the above snippet, we're first doing a prediction and then attaching the generated predictions to the orders_new_df
DataFrame. This is what the head
of orders_new_df
should look like:
order_id customer_id total_price age predicted_return
0 1037.0 981.0 193.460803 19.0 0.0
1 1027.0 940.0 680.986976 21.0 1.0
2 1039.0 123.0 952.906524 22.0 1.0
3 1043.0 860.0 545.791012 22.0 1.0
4 1046.0 316.0 887.003551 24.0 1.0
Let's plot our predictions, to see if they make sense:
plot_data = orders_new_df.sample(frac=0.5, random_state=123)
colors = ['red' if r else 'blue' for r in plot_data['predicted_return']]
plt.scatter(plot_data['age'], plot_data['total_price'], c=colors)
plt.xlabel('Age')
plt.ylabel('Total Price')
plt.show()
Here's the resulting plot:

If the predicted_return
values look good to us, we can create another dbt Python model that will run these predictions automatically. This new dbt model will first pick the best logistic regression model, use it to predict whether or not orders will be returned and finally store its predictions in our data warehouse.
Here's the definition for our new model, order_return_predictions.py
:
import pickle
def model(dbt, fal):
dbt.config(materialized="table")
models_df = dbt.ref("order_return_prediction_models")
best_model_name = models_df[
models_df.accuracy == models_df.accuracy.max()].model_name[0]
with open(f"ml_models/{best_model_name}.pkl", "rb") as f:
loaded_model = pickle.load(f)
orders_new_df = dbt.ref("customer_orders")
predictions = loaded_model.predict(orders_new_df[["age", "total_price"]])
orders_new_df["predicted_return"] = predictions
return orders_new_df
We can run this dbt model:
dbt run --select order_return_predictions
Conclusion
In this blog post, we have walked through how to use fal and dbt to manage and deploy machine learning models in a scalable and reproducible way. We used a synthetic shopping dataset to train a logistic regression model that predicts the probability of an order being returned. We automated the ML training process in a dbt model, and used the generated ML models to make predictions on new data. Finally, we were able to store the resulting predictions in a new dbt model. All of this can now run automatically.
Have questions? Reach out to us on our Discord server or raise an issue in our Github repository. If you're a member of the dbt Slack community, you can always find us at the #tools-fal channel.