Accelerate Hyperparameter Tuning with fal

We demonstrate how to speed up hyperparameter tuning using fal-serverless, so you can quickly explore the search space and find the best model configuration.

Accelerate Hyperparameter Tuning with fal

Hyperparameter tuning is an essential step in the machine learning pipeline. It involves optimizing the configuration settings of a model. These settings, known as hyperparameters, control the model's learning process and can greatly impact its performance. Finding the best combination of hyperparameters is a challenging task, often involving extensive exploration and evaluation of numerous configurations. This search can be computationally expensive and time-consuming, especially for complex models and large datasets.

In this blog post, we'll demonstrate how to speed up hyperparameter tuning using fal-serverless, allowing you to efficiently explore the search space and find the best model configuration. fal-serverless is our new Python library that helps you execute functions in an isolated, serverless environment. It allows you to offload computationally expensive tasks to cloud environments. Additionally, fal-serverless provides a way to concurrently run computations in separate environments, which enables faster processing and parallelization.

Step 0: Install dependencies and authenticate

We only need three dependencies: fal-serverless, scikit-learn and pandas.

pip install fal-serverless scikit-learn pandas
fal-serverless auth login

More info on authentication.

Step 1: Import modules and prepare data

For this example, we'll use a synthetic dataset generated using scikit-learn's make_classification function. The dataset contains 1,000 samples, each with 10 features, and a binary target variable. It can be downloaded from here.

import pandas as pd
from sklearn.model_selection import ParameterGrid
from concurrent.futures import as_completed
from fal_serverless import isolated

data = pd.read_csv("data.csv")
X, y = data.drop("target", axis=1), data["target"]

Step 2: Define the hyperparameter grid

We'll use scikit-learn's RandomForestClassifier and search for the best values of n_estimators, max_depth, and min_samples_split. We'll set up a parameter grid using ParameterGrid:

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
grid = ParameterGrid(param_grid)

This grid defines 18 different parameter combinations. Using fal-serverless, we will be able to quickly evaluate all of them.

Step 3: Create the isolated function

Next, we'll define an isolated function using the @isolated decorator from the fal_serverless library. This function will take a set of hyperparameters, the feature matrix X, and the target vector y, train a RandomForestClassifier model, and return the accuracy score. We also specify the required packages and machine type for the isolated environment:

@isolated(requirements=["pandas", "scikit-learn"], machine_type="L")
def evaluate_params(params, X, y):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    model = RandomForestClassifier(**params)
    model.fit(X, y)
    preds = model.predict(X)
    return accuracy_score(y, preds)

Step 4: Parallelize the hyperparameter tuning process using submit

To accelerate the hyperparameter tuning process, we'll use the submit method provided by fal_serverless to start multiple instances of the evaluate_params function in parallel. Then, we'll use the as_completed function from the concurrent.futures module to wait for all the tasks to complete:

futures = [evaluate_params.submit(params, X, y) for params in grid]
results = []

for future in as_completed(futures):
    results.append((future.result(), grid[len(results)]))

Step 5: Find the best hyperparameters

Finally, we can find the best hyperparameters by selecting the configuration with the highest accuracy score:

print("Best parameters:", max(results, key=lambda x: x[0])[1])

Summary

By using the fal-serverless library and the submit method, we've accelerated the hyperparameter tuning process, allowing for efficient exploration of the search space and finding the best model configuration. This approach can be easily adapted to other machine learning models and search spaces. It can also be incorporated into a dbt project, either by using the FalDbt class or fal hooks.

For more information on fal-serverless, check out our documentation and our examples. Feel free to reach out to our team on Discord and in dbt Slack community. 🚀