Feature Selection for ML with dbt and fal

We explore how to use fal with dbt together for feature selection, starting with synthetic data and progressing to analyzing, selecting relevant features, and training an ML model

Feature Selection for ML with dbt and fal

Feature selection is a critical step in building accurate and efficient predictive machine learning (ML) models. While dbt excels at managing data transformations, integrating dbt models into ML pipelines can be challenging. fal is designed to bridge the gap between dbt and Python, making it easy to incorporate dbt models into ML projects. In this blog post, we'll explore how to use fal with dbt together for feature selection, starting with synthetic data and progressing to analyzing, selecting relevant features, and training an ML model.

The accompanying example project is available on GitHub. If you'd like to clone and set up this project, make sure to configure your profiles.yml file and install the required dependencies by running pip install -r requirements.txt from the project directory. After that, execute dbt seed and dbt run to populate your database with synthetic data. The example project is set up to work with Postgres database but you can tweak the profile and requirements for your own adapter.

In ML, features represent the input attributes used to train a model, while labels represent the output the model aims to predict. Columns within dbt models can be thought of as potential features and labels. Our example project focuses on predicting product returns for a shop, with synthetic data divided into two tables: 'orders' and 'customer_data'. For instance, the 'orders' table contains columns like 'quantity' and 'total_amount', while the 'customer_data' table includes 'average_order_value', 'return_rate', and 'customer_segment'. These columns can be used as features to train a predictive model, with the 'returned' column in the 'orders' table serving as the target label.

Two-dimensional feature space, where each axis represents a feature and each colored point symbolizes an example with a specific label

The 'customer_orders' model consolidates information from the 'orders' and 'customer_data' tables, providing us with a comprehensive dataset containing all possible features. However, it's important to recognize that not all features may be relevant for creating an effective ML model. Including too many features can lead to overfitting and increased complexity, resulting in a model that doesn't generalize well to new data. As a result, it's crucial to analyze the available features in the 'customer_orders' model and select only the most relevant ones to build a robust and accurate predictive model for product returns.

There are several methods available for feature selection, including filter methods, wrapper methods, and embedded methods. Filter methods evaluate the relevance of features based on their correlation with the target variable, often using statistical tests such as Chi-square or ANOVA. Wrapper methods involve evaluating different feature subsets using a specific ML algorithm and selecting the subset that yields the best performance. Embedded methods combine the benefits of both filter and wrapper methods by incorporating feature selection as part of the model training process, such as with LASSO and Ridge regression.

In this blog post, we will use a correlation matrix for feature selection, which is a filter method that allows us to quantify the linear relationship between each feature and the target variable, as well as the relationships between features themselves. By analyzing the correlation matrix, we can identify features that have a strong linear relationship with the target variable and are thus more likely to be relevant for the predictive model. It's important to note that correlation does not imply causation, and a high correlation between two variables might not always mean that one variable directly influences the other. Nonetheless, a correlation matrix is a useful starting point for determining which features are likely to contribute the most to the model's predictive power.

In the the example Jupyter notebook, we begin by importing the necessary modules:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from fal import FalDbt

Next, we initialize a FalDbt object called faldbt by specifying the paths for the dbt project and the target folder.

faldbt = FalDbt(project_dir=".", profiles_dir=".")

# Print models and statuses
models = faldbt.list_models()
for model in models:
  print(f"model: {model.name}, status: {model.status}")

# Have a look at customer_orders model
customer_orders_df = faldbt.ref("customer_orders")

The list_models() method is then used to retrieve a list of dbt models, and we iterate through the list, printing the name and status of each model. This helps us ensure that all models are ready for use in our feature selection process. We then use the faldbt.ref() method to fetch the 'customer_orders' model, which we then store as a Pandas DataFrame called customer_orders_df. By calling the head() method on this DataFrame, we can quickly preview the first few rows of the 'customer_orders' model.

It's time to build our correlation matrix:

data = customer_orders_df

# Encode the 'CustomerSegment' categorical variable
data['customer_segment'] = pd.Categorical(data['customer_segment']).codes

# Exclude ID columns
columns_to_exclude = ['customer_id', 'order_id']
data_filtered = data.drop(columns=columns_to_exclude)

# Compute the correlation matrix
corr_matrix = data_filtered.corr()

# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')

Since the 'customer_segment' variable is categorical ("loyal", "new" or "infrequent"), we start by encoding it numerically to compute correlations. We compute the correlation matrix for the dataset using the corr() method on the DataFrame. This matrix contains pairwise correlation coefficients between all variables in the dataset. To visualize the correlation matrix, we use Seaborn's heatmap function, which creates a heatmap plot.

The correlation matrix provides the pairwise correlation coefficients between all the columns in the dataset. These correlation coefficients range from -1 to 1, where -1 indicates a strong negative correlation, 1 indicates a strong positive correlation, and values close to 0 indicate little to no correlation.

From the correlation matrix, we can observe several relationships between the features and the 'returned' label. For instance, 'total_amount' (0.206168) and 'quantity' (0.183583) have a positive correlation with the 'returned' label, indicating that as these values increase, the likelihood of a product being returned also increases. Another notable observation is the strong positive correlation between 'return_rate' and 'returned' (0.680370), which makes sense, since customers with a higher return rate are more likely to return a product. On the other hand, 'customer_segment' has a strong negative correlation with the 'returned' label (-0.219652), suggesting that as the customer segment value increases, the chances of an order being returned decrease.

To illustrate how different features affect the accuracy of a logistic regression model, we can train the model with various combinations of features and compare the results. Here's a Python script that demonstrates this process using scikit-learn:

# Encode the 'CustomerSegment' categorical variable
data['customer_segment'] = pd.Categorical(data['customer_segment']).codes

# Define the label
y = data['returned']

# Define feature sets with different combinations of features
feature_sets = [
    ['quantity', 'total_amount'],
    ['quantity', 'total_amount', 'average_order_value'],
    ['quantity', 'total_amount', 'return_rate'],
    ['quantity', 'total_amount', 'customer_segment'],
    ['quantity', 'total_amount', 'average_order_value', 'return_rate', 'customer_segment']

for features in feature_sets:
    # Select features for this iteration
    X = data[features]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Create and train the logistic regression model
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Calculate the accuracy of the model
    accuracy = accuracy_score(y_test, y_pred)

    # Calculate precision, recall, F1-score, confusion matrix, and ROC-AUC score
    report = classification_report(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)

    # Print the results
    print(f"Features: {features}")
    print(f"Accuracy: {accuracy}")
    print(f"ROC-AUC: {roc_auc}")
    print(f"Classification report:\n{report}")
    print(f"Confusion matrix:\n{cm}\n")

This script trains logistic regression models with different combinations of features and evaluates them on a number of model metrics. By comparing these metrics, we can get an idea of how different features impact the model's performance. Here is an example result:

Features: ['quantity', 'total_amount', 'average_order_value', 'return_rate', 'customer_segment']
Accuracy: 0.9451510333863276
ROC-AUC: 0.910433569979716
Classification report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97       986
           1       0.89      0.85      0.87       272

    accuracy                           0.95      1258
   macro avg       0.93      0.91      0.92      1258
weighted avg       0.94      0.95      0.94      1258

Confusion matrix:
[[958  28]
 [ 41 231]]

This result shows that the model with features “quantity”, “total_amount”, “average_order_value”, “return_rate” and “current_segment” performed quite well. It achieved an accuracy of 94.5%, which means it correctly predicted the outcome for 94.5% of the test cases. The ROC-AUC score was 0.91, so the model is good at distinguishing between returned and non-returned products.

To improve the dbt model, we can now refine the 'customer_orders' model to include only the relevant features. This will streamline the model and make it more efficient, especially when it comes to integrating with machine learning pipelines.

In conclusion, this blog post demonstrated the use of fal and dbt to effectively preprocess data, perform feature selection, and train machine learning models. By analyzing the correlation matrix, we were able to identify the most relevant features and improve the accuracy of our logistic regression model for predicting product returns. Integrating dbt with your machine learning pipeline can greatly enhance the process of feature engineering and ensure that your models are built on reliable and well-organized data.

In our upcoming blog post, we will delve into the world of feature stores. Feature stores are a powerful tool for managing, sharing, and serving features to machine learning models. They facilitate collaboration between data scientists and engineers, enabling them to work on features independently and reducing duplicated efforts. Stay tuned to learn more about how you can build a feature store with fal and dbt.