Build and Deploy Machine Learning Models from Jupyter Notebooks with fal and dbt

Build and Deploy Machine Learning Models from Jupyter Notebooks with fal and dbt

Introduction

Machine Learning (ML) is increasingly important in data-driven decision making, so it's important to use modern tools and techniques to streamline the machine learning workflows. This is where dbt and fal can come in - together they make it easy to manage and deploy machine learning models in a scalable and reproducible way. In this blog post, we'll walk you through how to use fal and dbt to train and store a logistic regression ML model, make predictions on fresh data, and store those predictions in a dbt model. By the end of this post, you'll be equipped with the skills and knowledge to apply these tools to your own ML projects.

Setup

We have prepared an example project that you can play with as you read this blog post. It has both a dbt project with some synthetic data as well as an example Jupyter notebook. You can clone it:

git clone https://github.com/fal-ai/dbt_fal_ml_example

We are using dbt-fal as a Python adapter. It's the easiest way to run a dbt Python model. This project also uses BigQuery as a data warehouse. You can edit the requirements.txt file to suit your own data warehouse. You can then install the project requirements in a new Python environment by running:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Let’s also create a profiles.yml file inside our project directory and fill it with necessary credentials:

example_shop:
  target: staging
  outputs:
    staging:
      type: fal
      db_profile: db
    db:
      type: bigquery
      method: service-account-json
		...

The db output should contain your data warehouse credentials.

Finally, lets start the Jupyter notebook:

jupyter notebook notebooks/Experiments.ipynb

This will print out a URL in your terminal that you can use in a browser and open the "Experiments.ipynb" notebook.

Our example dataset simulates customer orders and order returns in a retail setting. The dataset contains information on customer ages, total order prices, and whether or not each order was returned.

Data exploration and preparation

In our notebook, we start by importing all of the necessary modules:

import pickle
import uuid
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from fal import FalDbt

Jupyter notebooks allow us to run shell commands and so we can run dbt seed and dbt run:

!dbt seed --profiles-dir ..
!dbt run --select customer_orders customer_orders_labeled --profiles-dir ..

As you can see, we're calculating two dbt models, customer_orders and customer_orders_labeled. As names suggest, one dataset contains labeled data, whereas the other has unlabeled "fresh" data.

Let's have a look at the customer_orders_labeled model. We must instantiate FalDbt:

faldbt = FalDbt(project_dir="..", profiles_dir="..")

Now we can download the customer_orders_labeled model as a pandas DataFrame and print the top rows:

orders_df = faldbt.ref("customer_orders_labeled")
orders_df.head()

This prints a table that looks like this:

    order_id 	customer_id 	total_price 	age 	return
0 	210.0    488.0          187.861698    	18.0 	0.0
1 	263.0    578.0          628.745330      18.0 	0.0
2 	360.0    578.0          99.154886       18.0 	0.0
3 	482.0    818.0          393.284591      18.0 	0.0
4 	594.0    656.0          339.542104      18.0 	0.0

The return column is numeric, where 0.0 means that order has not been returned and 1.0 means that the order has been returned. Since it's the value of the return column that we want to predict, we call this column a label, the other columns are features. Let's assume that the features order_id and customer_id do not play role in whether or not an order is returned. This leaves us with total_price and age.

A good way to visualize a relationship between features and labels is to make a plot. We can do this easily by using the matplotlib library:

plot_data = orders_df.sample(frac=0.1, random_state=123)

colors = ['red' if r else 'blue' for r in plot_data['return']]  # assign colors based on whether or not order was returned

plt.scatter(plot_data['age'], plot_data['total_price'], c=colors)
plt.xlabel('Age')
plt.ylabel('Total Price')
plt.show()

Here's the resulting plot:

Age and price distribution of returned (red) and not returned (blue) orders

Red dots correspond to orders that have been returned. We can see from our plot that the orders in top left tend to be returned more often than other orders.

ML model training and evaluation

The ML model type that we will train and evaluate is called logistic regression. Logistic regression is suitable for this problem because the target label (return) is binary (0 or 1), and logistic regression models can output probabilities that lie between 0 and 1. In our case, the output of the logistic regression model will be the probability of an order being returned, given the customer's age and the total order price.

We start by splitting the dataset into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(
  orders_df[['age', 'total_price']],
  orders_df['return'],
  test_size=0.2,
  random_state=42)

This will split the dataset into training and testing sets, with 80% of the data used for training and 20% of the data used for testing.

Next, let's train a logistic regression model on the training set. We will use the LogisticRegression class object from scikit-learn, it is a quick and simple implementation of logistic regression:

lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)

The fit method of the lr_model object does the training. Once this cell finishes calculating, lr_model will be trained with our data.

Once a model is trained, we can evaluate its performance on test data using the predict method:

# Make predictions on the test data
y_pred = lr_model.predict(X_test)

# Print a classification report
print(classification_report(y_test, y_pred))

This will output a classification report that summarizes the performance of the model:

              precision    recall  f1-score   support

         0.0       0.87      0.97      0.91       227
         1.0       0.85      0.53      0.66        73

    accuracy                           0.86       300
   macro avg       0.86      0.75      0.79       300
weighted avg       0.86      0.86      0.85       300

The classification report shows that the model has an accuracy of 0.86 on the test data. The precision of the model is 0.87 for class 0 (no return) and 0.85 for class 1 (return). The recall of the model is 0.97 for class 0 and 0.53 for class 1. The F1-score of the model is 0.91 for class 0 and 0.66 for class 1.

We can see that the model has good precision and recall for class 0 (no return), but lower precision and recall for class 1 (return). This suggests that the model may be better at predicting orders that will not be returned than orders that will be returned. Nonetheless, the model has an overall accuracy of 0.87, indicating that it can make reasonably accurate predictions on new data.

Automate ML training by using a dbt model

Storing the model training workflow in a dbt model allows us to version the model data and share it with other users. We start by creating a new dbt model in models directory: order_return_prediction_models.py. This is a Python model and we adapt the above notebook code into the model definition:

import pickle
import uuid
import pandas as pd
import datetime
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

def model(dbt, fal):
    dbt.config(materialized="table")
    orders_df = dbt.ref("customer_orders_labeled")
    X = orders_df[['age', 'total_price']]
    y = orders_df['return']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

    print("Model init")
    lr_model = LogisticRegression(random_state=123)

    print("Model fitting")
    lr_model.fit(X_train, y_train)

    # Test model
    y_pred = lr_model.predict(X_test)

    print("Preparing the classification report")
    # Create a report and put it in a DataFrame
    model_name = str(uuid.uuid4())
    y_test = y_test.astype(float)
    report = classification_report(y_test, y_pred, output_dict=True)
    report["model_name"] = model_name
    report["date"] = datetime.datetime.now()
    output_df = pd.DataFrame([report])
    output_df = output_df.rename(columns={"0.0": "target_0", "1.0": "target_1"})
    output_df.set_index("model_name")

    print("Saving the model")
    # Save model weights
    with open(f"ml_models/{model_name}.pkl", "wb") as f:
        pickle.dump(lr_model, f)

    return output_df

The order_return_prediction_models model gets labeled orders data and trains a logistic regression model lr_model. The model is then evaluated and the report is stored in a DataFrame along with a unique model name. Next, we save the model weights to local storage. You can modify this step to store the model weights on a cloud storage platform, such as S3. Finally, the output DataFrame is returned and its contents are therefore persisted in our data warehouse.

We can run this dbt model:

dbt run --select order_return_prediction_models

Making predictions with a stored model

First, we try making a prediction in a Jupyter notebook and then we make a Python model to do this automatically.

In our Jupyter notebook, we can easily find the model with best accuracy:

models_df = faldbt.ref("order_return_prediction_models")
best_model_name = models_df[
		model_df.accuracy == models_df.accurary.max()
	].model_name[0]

We then load this model from local storage (or a cloud storage provider):

with open(f"../ml_models/{model_name}.pkl", "rb") as f:
    loaded_model = pickle.load(f)

We also load the new order data and check what it looks like:

orders_new_df = faldbt.ref("customer_orders")
orders_new_df.head()

This prints out a table:

	order_id 	customer_id 	total_price 	age
0 	1037.0   	981.0       	193.460803  	19.0
1 	1027.0 	  	940.0	        680.986976  	21.0
2 	1039.0  	123.0       	952.906524      22.0
3 	1043.0  	860.0       	545.791012  	22.0
4 	1046.0  	316.0       	887.003551  	24.0

As we see, this DataFrame has a similar shape to customer_orders_labeled except it lacks the return column. This is the column that we would like to predict.

So, lets do a prediction:

predictions = loaded_model.predict(orders_new_df[["age", "total_price"]])
orders_new_df["predicted_return"] = predictions
order_new_df.head()

In the above snippet, we're first doing a prediction and then attaching the generated predictions to the orders_new_df DataFrame. This is what the head of orders_new_df should look like:

	order_id 	customer_id 	total_price 	age 	predicted_return
0 	1037.0   	981.0        193.460803  	19.0 	0.0
1 	1027.0  	940.0 	     680.986976  	21.0 	1.0
2 	1039.0  	123.0 	     952.906524  	22.0 	1.0
3 	1043.0  	860.0 	     545.791012  	22.0 	1.0
4 	1046.0  	316.0 	     887.003551  	24.0 	1.0

Let's plot our predictions, to see if they make sense:

plot_data = orders_new_df.sample(frac=0.5, random_state=123)

colors = ['red' if r else 'blue' for r in plot_data['predicted_return']]
plt.scatter(plot_data['age'], plot_data['total_price'], c=colors)
plt.xlabel('Age')
plt.ylabel('Total Price')
plt.show()

Here's the resulting plot:

Age and price distribution of returned (red) and not returned (blue) orders

If the predicted_return values look good to us, we can create another dbt Python model that will run these predictions automatically. This new dbt model will first pick the best logistic regression model, use it to predict whether or not orders will be returned and finally store its predictions in our data warehouse.

Here's the definition for our new model, order_return_predictions.py:

import pickle
def model(dbt, fal):
    dbt.config(materialized="table")
    models_df = dbt.ref("order_return_prediction_models")
    best_model_name = models_df[
			models_df.accuracy == models_df.accuracy.max()].model_name[0]
    with open(f"ml_models/{best_model_name}.pkl", "rb") as f:
        loaded_model = pickle.load(f)
    orders_new_df = dbt.ref("customer_orders")
    predictions = loaded_model.predict(orders_new_df[["age", "total_price"]])
    orders_new_df["predicted_return"] = predictions
    return orders_new_df

We can run this dbt model:

dbt run --select order_return_predictions

Conclusion

In this blog post, we have walked through how to use fal and dbt to manage and deploy machine learning models in a scalable and reproducible way. We used a synthetic shopping dataset to train a logistic regression model that predicts the probability of an order being returned. We automated the ML training process in a dbt model, and used the generated ML models to make predictions on new data. Finally, we were able to store the resulting predictions in a new dbt model. All of this can now run automatically.

Have questions? Reach out to us on our Discord server or raise an issue in our Github repository. If you're a member of the dbt Slack community, you can always find us at the #tools-fal channel.