Building language model powered pipelines with dbt

Building language model powered pipelines with dbt

Yes, we know… Generative AI is all the hype right now. Whether you are an Analytics Engineer or Data Engineer, you might be wondering if this stuff is only useful for AI generated profile pics, or can I actually use it to help me with my day to day work. Turns out there is a ton of actually useful things you can do with large language models (LLMs), and thereby simplify some complex data pipelines.

In this post, we will go through an example of extracting physical addresses from free-form text fields using a combination of dbt and dbt-fal. We will build a dbt Python model using dbt-fal, which is a dbt adapter that lets us run Python models with all databases (including BigQuery and PostgreSQL). Our Python model will load a deep learning model for Named Entity Recognition. This model will help us extract addresses (and some other useful metadata like persons and organizations) from free-form text. This is a very good use case for when you are trying to extract some named entities (in this case, addresses) from unstructured text. There are many more exciting use cases for deep learning models, and we are excited to explore those over the next weeks.

Why use an LLM instead of regex?

LLMs can be really good at transforming unstructured text into meaningful data. Especially if the model is trained for a specific task. The skeptic in you might claim that extracting data from unstructured text is best done with regular expressions. We all wrote regular expressions that worked 80% of the time and created unexpected bugs, so maybe there is a better way? Some of the benefits of LLMs include the fact that they can capture complex patterns and generalize to unseen data. You can also handle multiple data types and fine-tune your model to suit your needs.

Getting Started

pip install dbt-fal["bigquery"] # You can use snowflake or postgres here

We then start a new dbt project by running dbt init. We also configure our profiles.yml file for dbt to correctly use the dbt-fal adapter. You can find the instructions here.

Create raw data and seed it

We now want to create a CSV file with some free-form text. This is for illustrative purposes. In a real-world use case, this would most likely be another upstream table.

free_form_text.csv looks like this:

sentences
My name is Jane and I live in San Francisco
My name is Burkay and I live in New York
My name is Alicia and I live in London
My name is John and I live in Berlin

To load the data into your data warehouse, run dbt seed.

Define a new Python environment

The Python model that we will create needs some dependencies to run successfully. One way to achieve this is to install these dependencies locally (assuming dbt is running locally), but we recommend using fal environments for production settings.

To use fal environments, we create a fal_project.yml file to define a new environment called named_entity_recognition. This is a very useful feature from dbt-fal that allows us to create reusable, isolated Python environments. Our fal_project.yml file looks like this:

environments:
  - name: named_entity_recognition
    type: venv
    requirements:
      - transformers
      - torch
      - accelerate
      - datasets
      - pandas

Here we include all the dependencies that our ML model will need. These are transfomers, torch, accelerate, datasets and pandas

Create a new Python model that will extract addresses

Now we get to the crux of this tutorial where we create a Python model called extracted_addresses.py which will extract addresses from the free-form text. This model loads a Named Entity Recognition ML model hosted on Hugging Face and then processes the upstream table with the example sentences. It then outputs the results into a table called extracted_addresses.

def model(dbt, fal):
    from transformers import AutoTokenizer, AutoModelForTokenClassification
    from transformers import pipeline
    from pathlib import Path
    from datasets import Dataset
    import pandas as pd

    dbt.config(fal_environment="named_entity_recognition")

    cache_dir = Path("/data/huggingface/large-ner").expanduser()
    cache_dir.mkdir(parents=True, exist_ok=True)

    tokenizer = AutoTokenizer.from_pretrained("dslim/bert-large-NER")
    model = AutoModelForTokenClassification.from_pretrained(
        "dslim/bert-large-NER", cache_dir=cache_dir
    )

    nlp = pipeline("ner", model=model, tokenizer=tokenizer)

    df = dbt.ref("free_form_text")
    dataset = Dataset.from_pandas(df)

    ner_results = nlp(dataset["sentences"])
    return pd.DataFrame({"results" : ner_results})

Run dbt!

Now that everything is ready, we can run our pipeline with dbt run. The first time you run it might take some time since dbt-fal will install all the required packages. Following runs will be very fast as the environment is cached. You will get a table like this:

These results are a bit cryptic, but let’s dive in to see whether we can make sense out of them. Here, each Row represents the entities that were extracted from a given sentence. Let’s look at Row 1:

For the input sentence “My name is Jane and I live in San Francisco", the ML model has identified three entities: “Jane”, “San” and “Francisco”.

“Jane” (as denoted by the results.word column) is a B-PER entity (as denoted by the results.entity column) which means that she is a “Person”. result.start and result.end tells us where the string in the original sentence starts and ends, in the case of “Jane”, the word is at the 11th to 15th index in the sentence. Similarly “San” and “Francisco” are parts of a “Location” entity where B-LOC denotes the beginning of the location word and I-LOC denotes that the word is continuing and is also a “Location”.

Let’s make things faster. dbt using GPUs!

If you got this far and successfully ran this example, you might notice that it is a little slow or it did not run at all (if you have a slow computer). This is because we are running all Python models in dbt-fal locally by default, and these LLMs typically require more compute resources to run efficiently. Now comes the cool part. With a tiny change in our dbt-fal configuration, we can now use GPUs in the cloud to run this model much faster.

In order to do this, change profiles.yml to add your fal-serverless keys (reach out to us if you want access)

default:
  outputs:
    production:
      db_profile: warehouse
      type: fal
      host: REACH_OUT_TO_FAL_TEAM_TO_ACCESS
      key_id: REACH_OUT_TO_FAL_TEAM_TO_ACCESS
      key_secret: REACH_OUT_TO_FAL_TEAM_TO_ACCESS

Add machine configuration to your model, and don’t forget to run the Hugging Face pipeline on top of the GPU device.

def model(dbt, fal):
    ...
    dbt.config(fal_environment="named_entity_recognition", fal_machine="GPU")
    ...
		nlp = pipeline("ner", model=model, tokenizer=tokenizer, device="cuda:0")
    ...

Magical! With a few config changes, we are able to go from running dbt-fal locally to running it in the cloud with GPUs when we need the performance to scale to larger datasets.

Summary

In this post, we went over a practical use case for using language models for Analytics Engineering workloads where we extract addresses from free-form text. We started with a local dbt-fal project, we iterated locally, and we pushed our workload to a GPU machine in the cloud with almost no effort.

To find the working code example, you can go here. If you want access to fal-serverless, don’t hesitate to reach out to us at hello@fal.ai.

We would love to hear more from you regarding AI! What else would you like to solve with it? Let us know in #tools-fal channel on dbt Slack and don’t forget to star our Github repo.