How to run sentiment analysis on your dbt models from Python

How to run sentiment analysis on your dbt models from Python

Sentiment analysis is the practice of determining whether a text / statement is positive, negative or neutral typically using natural language processing (NLP).

Applying sentiment analysis on data such as reviews, tickets, feedback and survey responses can help you understand how your organization is doing in the eyes of your customers.

In this post, we will use Hugging Face sentiment-analysis transformer on Zendesk tickets, and we will automate this process to happen on any new tickets coming in.

Exploratory phase

We start by installing all dependencies we will need:

$ pip install fal transformers torch

First, let's write it in a Jupyter Notebook to see how this would look. Let's import and initialize a fal dbt project:

from fal import FalDbt

faldbt = FalDbt(profiles_dir="./fal_dbt_examples", project_dir="./fal_dbt_examples")

Let's see what dbt models are available for us to use in here:

print(faldbt.list_model_ids())
# { ...
#   'stg_zendesk_ticket_data': <RunStatus.Success: 'success'>
#  ... }

Now, let's use it to get the data we want to analyze

ticket_data = faldbt.ref("stg_zendesk_ticket_data")
ticket_descriptions = list(ticket_data.description)
print(ticket_descriptions)
# ['nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening ', 
#  'Huge disappointment', ...]

We can preview two opinions here. Let's use huggingface's transformers to do the sentiment analysis on them:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
description_sentimet_analysis = classifier(ticket_descriptions)
print(description_sentimet_analysis)
# [{'label': 'POSITIVE', 'score': 0.9987179040908813}, 
#  {'label': 'NEGATIVE', 'score': 0.9998005032539368}, ...]

This looks great, we now know how to process this data in a program. This is something that could bring value to our organization.

Automating the process

Now that we know how to process the data, we would like to automate this process to happen for tickets in the future. Let's include it in our dbt project.

We add the script it to the appropriate model in the schema file. This instructs fal to run the script only if the stg_zendesk_ticket_data dbt model is run.

# models/schema.yml
models:
  - name: stg_zendesk_ticket_data
    meta:
      fal:
        scripts:
          - scripts/zendesk_sentiment_analysis.py

To adapt the script to a proper fal script, we need to remove references to the faldbt object and instead use the functions it offers directly as a global function. So faldbt.ref("stg_zendesk_ticket_data") becomes ref("stg_zendesk_ticket_data"), and so on.

# scripts/zendesk_sentiment_analysis.py
from transformers import pipeline
import pandas as pd
import numpy as np

ticket_data = ref("stg_zendesk_ticket_data")
ticket_descriptions = list(ticket_data.description)
classifier = pipeline("sentiment-analysis")
description_sentimet_analysis = classifier(ticket_descriptions)

# Prepare rows for pandas df
rows = []
for id, sentiment in zip(ticket_data.id, description_sentimet_analysis):
    rows.append((int(id), sentiment["label"], sentiment["score"]))

# Set correct types for write_to_source and create df
records = np.array(rows, dtype=[("id", int), ("label", "U8"), ("score", float)])
sentiment_df = pd.DataFrame.from_records(records)

print("Uploading\n", sentiment_df)
write_to_source(sentiment_df, "results", "ticket_desc_sentiment_analysis")

You can see that we added in the end a section for preparing a pandas DataFrame and writing it to a dbt source. To do this, we need to define it in the dbt schema.yml.

# models/schema.yml
sources:
  - name: results
    tables:
      - name: ticket_desc_sentiment_analysis

Put it to the test

First, let's seed the test data

$ dbt seed
Running with dbt=0.21.1
Found 6 models, 0 tests, 0 snapshots, 0 analyses, 184 macros, 0 operations, 2 seed files, 1 source, 0 exposures

11:46:59 | Concurrency: 1 threads (target='dev')
11:46:59 |
...
11:47:06 | 2 of 2 START seed file dbt_matteo.zendesk_ticket_data................ [RUN]
11:47:11 | 2 of 2 OK loaded seed file dbt_matteo.zendesk_ticket_data............ [INSERT 10 in 4.25s]
11:47:11 |
11:47:11 | Finished running 2 seeds in 14.27s.

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Then, we run dbt.

$ dbt run
Running with dbt=0.21.1
Found 6 models, 0 tests, 0 snapshots, 0 analyses, 184 macros, 0 operations, 2 seed files, 1 source, 0 exposures

11:47:24 | Concurrency: 1 threads (target='dev')
11:47:24 |
...
11:47:28 | 3 of 6 START table model dbt_matteo.stg_zendesk_ticket_data.......... [RUN]
11:47:31 | 3 of 6 OK created table model dbt_matteo.stg_zendesk_ticket_data..... [CREATE TABLE (10.0 rows, 2.6 KB processed) in 3.03s]
11:47:31 | 4 of 6 START table model dbt_matteo.zendesk_ticket_metrics........... [RUN]
11:47:35 | 4 of 6 OK created table model dbt_matteo.zendesk_ticket_metrics...... [CREATE TABLE (76.0 rows, 2.2 KB processed) in 3.72s]
...
11:47:40 |
11:47:40 | Finished running 4 table models, 2 view models in 19.15s.

Completed successfully

Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

And finally, running fal downloads the rows from the data warehouse, processes them and writes them right back in the data warehouse for further dbt usage or easy reference from external tools.

$ fal run
11:52:44 | Starting FAL run for following models (model_name: scripts):
zendesk_ticket_metrics:
stg_zendesk_ticket_data: fal_scripts/zendesk_sentiment_analysis.py

Running script .../fal_dbt_examples/fal_scripts/zendesk_sentiment_analysis.py for model stg_zendesk_ticket_data
Running query
SELECT * FROM `<project>`.`dbt_matteo`.`stg_zendesk_ticket_data`
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Uploading
       id     label     score
0   6605  NEGATIVE  0.999511
1   1966  NEGATIVE  0.992293
2   6171  POSITIVE  0.998718
3   1404  NEGATIVE  0.999476
4  14173  POSITIVE  0.999527
5   4721  NEGATIVE  0.999801
6  16988  POSITIVE  0.999876
7  11013  NEGATIVE  0.867994
8  11071  POSITIVE  0.994939
9   1595  POSITIVE  0.956718
Running query
CREATE TABLE `dbt_matteo`.`ticket_data_sentiment_analysis` (
	`id` INT64,
	`label` STRING,
	`score` FLOAT64
)
Running query
INSERT INTO `dbt_matteo`.`ticket_data_sentiment_analysis` (`id`, `label`, `score`) VALUES (6605, 'NEGATIVE', 0.9995110034942627), (1966, 'NEGATIVE', 0.9922933578491211), (6171, 'POSITIVE', 0.9987179040908813), (1404, 'NEGATIVE', 0.9994755387306213), (14173, 'POSITIVE', 0.9995272159576416), (4721, 'NEGATIVE', 0.9998005032539368), (16988, 'POSITIVE', 0.9998757839202881), (11013, 'NEGATIVE', 0.8679937124252319), (11071, 'POSITIVE', 0.9949386119842529), (1595, 'POSITIVE', 0.9567179083824158), ...

Conclusion

We played a bit with how sentiment analysis on text could give some value to our organization and then automated this processing for easy access to it. Instead of writing our predictions back to the data warehouse, we could instead have wanted to send this as an image report in a Slack notification, or maybe an email. For more examples, and further exploration of the tool to do more with dbt, check out the fal repo or join our Discord. We are always open to ideas on how you would want to use your dbt project in new ways.