How to run Python with dbt Cloud using Github Actions and dbt Cloud API

Image of fal and dbt logos on a cloud
We are in the clouds.

dbt is our favorite tool to build data pipelines. It allows us to skip boilerplate data engineering code, focus only on SQL and helps us build with software engineering practices like reusability, auto generating docs and tests. We recently introduced our open source project fal which let us run Python code alongside dbt, making our data stack even more powerful.

With the dbt + fal combination, we no longer had to set up a full-fledged orchestrator like Airflow to run Python code. This saved us the cost and the headache of maintaining another piece of infrastructure. Since the release of fal, we have received overwhelmingly positive reactions and contributions from the community, but one issue stood out: How do we run fal with dbt Cloud?

As dbt Cloud users ourselves, we love its simplicity for the IDE and scheduling functionality, however we could not run fal due to a limitation of dbt Cloud: it does not allow us to run any executables other than dbt commands.

Wouldn't it be great to be able to use dbt Cloud and at the same time run custom Python scripts on dbt models? Here we describe a way of doing just that using GitHub Actions as a fal runtime.

We chose Github Actions as a runtime for its simplicity and ubiquity. For longer running workloads and better debugging, we envision that there could be a better runtime for fal scripts (stay tuned 🙂).

Prerequisites

  1. Data warehouse account (Snowflake, Bigquery)
  2. dbt Cloud account
  3. dbt Cloud project connected to a GitHub repository

Example Project

Our example project contains Zendesk ticket data. Zendesk is a customer support platform and Zendesk tickets contain customer messages that describe the problem they ran into. We want to do a simple transformation on this data with dbt and then send a Slack message after a dbt Cloud run.

Our example repository comes with raw data in CSV format. Before we setup any jobs for it, we should run dbt seed from dbt Cloud IDE. In order to work, it needs two environment variables: DBT_GCLOUD_PROJECT (GCP project id) and DBT_BQ_DATASET (BigQuery dataset name for seeding). To find out more about dbt seed, see here.

Implementation overview

Below is the overview of how we are going to run fal after a dbt Cloud run using GitHub Actions as a Python runtime.

Diagram describing our proposed approach
Proposed approach

Here, GitHub Actions is stitching our dbt Cloud job and fal scripts. The first step in our workflow triggers a dbt Cloud job through our new dbt Cloud Github Action that we just published. Once the triggered job is complete, the fal run command is ran. Here is a working example, and below we will go into further details of how it all comes together.

Setup dbt Cloud job

In dbt Cloud, on the Jobs page click the "New Job" button. A job creation form will open and in this form you can setup the environment, commands and configurations for the dbt job. Switch off the schedule checkbox as scheduling will be taken care of by GitHub Actions. Once you click “Save”, you can see the new job id as well as your account id in the url. Copy these ids, as they will come in handy later.

dbt Cloud Jobs interface

Repository secrets

In order to trigger dbt Cloud jobs and run fal scripts at the same time, GitHub needs access to a number of sensitive environment variables. This is because the fal run command needs the profiles.yml to communicate with your data warehouse. These variables should be stored as repository secrets:

  • DBT_CLOUD_API_TOKEN - dbt Cloud api token
  • PROFILES_YML - contents of dbt profiles.yml
  • SLACK_BOT_TOKEN - Token for Slack bot
  • SLACK_BOT_CHANNEL - Slack channel id for sending messages
  • DBT_GCLOUD_PROJECT - GCP project id
  • DBT_BQ_DATASET - BigQuery dataset name for seeding

See our dbt Slack integration article for instructions on how to get Slack variables.

GitHub Action Workflow

GitHub Actions run workflows specified in YAML files. Here’s our example YAML file:

name: Run fal scripts
on:
  workflow_dispatch:

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: actions/setup-python@v2
        with:
          python-version: "3.8.x"

      - name: Install dependencies
        run: |
          pip install --upgrade --upgrade-strategy eager -r requirements.txt
      
      - uses: fal-ai/dbt-cloud-action@main
        id: dbt_cloud_run
        with:
          dbt_cloud_token: ${{ secrets.DBT_CLOUD_API_TOKEN }}
          dbt_cloud_account_id: 123456 # Copied from the dbt Cloud UI
          dbt_cloud_job_id: 54732 # Copied from the dbt Cloud UI

      - name: Run fal scripts
        env:
          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
          SLACK_BOT_CHANNEL: ${{ secrets.SLACK_BOT_CHANNEL }}
          DBT_GCLOUD_PROJECT: ${{ secrets.DBT_GCLOUD_PROJECT }}
          DBT_BQ_DATASET: ${{ secrets.DBT_BQ_DATASET }}
          PROFILES_YML: ${{ secrets.PROFILES_YML }}
        run: |
          echo "$PROFILES_YML" >> profiles.yml
          fal run --profiles-dir .

      # Clean up
      - uses: actions/checkout@v2

In the example repository, this YAML is saved at .github/workflows/zendesk_data_workflow.yml.

First, the workflow prepares the environment. This happens in the initial three steps. After that, dbt Cloud job is triggered using the dbt Cloud Github Action. This step requires a set of environment variables listed in the previous section. Once the run is complete, Python scripts are run using the fal run command. Finally, the environment is tidied up from side effects.

Run fal scripts

Script that sends Slack messages looks like this:

import os
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError

CHANNEL_ID = os.getenv("SLACK_BOT_CHANNEL")
SLACK_TOKEN = os.getenv("SLACK_BOT_TOKEN")

client = WebClient(token=SLACK_TOKEN)
message_text = f"Model: {context.current_model.name}. Status: {context.current_model.status}."
try:
    response = client.chat_postMessage(
        channel=CHANNEL_ID,
        text=message_text
    )
except SlackApiError as e:
    assert e.response["error"]

This example requires slack_sdk to be installed. We will be using the WebClient class for sending messages to our Slack app:

fal provides a magic variable context that gives you access to dbt model information, such as model name and status. A message created using this variable is sent to the target Slack channel. See this blog post for more details on how to setup Slack messaging from fal.

Start a GitHub Action workflow

With the workflow YAML saved, committed and pushed, we can now run it. Since we’ve set the workflow trigger to workflow_dispatch, we can start it manually from the repository page:

Github Actions Interface

On your repository page, click on the Actions tab (1), then select the desired action from the side menu (2), click on the “Run workflow” (3) and then with the branch set to main , click on the green “Run workflow” button (4). A new workflow run will be triggered and appear on the workflow list (instead of “This workflow has no runs yet.” in the picture above). Once the workflow is complete, a green check mark ✅ will appear next to it and you should receive a Slack message in your desired channel.

Scheduled workflows

If we want to schedule our workflow to run every day at midnight, Github Actions allow us to do that by modifying the workflow YAML file, specifically the on section:

name: Zendesk Ticket Data Workflow
on:
  workflow_dispatch:
	schedule:
		- cron: '0 0 * * *'

The schedule node allows us to set a cron style schedule. In this example the workflow is scheduled to run at midnight every day.

We might also want to run our workflow each time changes happen in our default branch, such as pushing of a commit or merging another branch. To do this, we can add another trigger to on section:

name: Zendesk Ticket Data Workflow
on:
  workflow_dispatch:
	schedule:
		- cron: '0 0 * * *'
	push:
		branches:
			- main

where main is the default branch name. For more information on triggering action workflows see here.

Summary

By using dbt Cloud, fal and GitHub Actions together, we were able to setup a simple workflow that runs dbt jobs in dbt Cloud and subsequently runs fal scripts on dbt models. GitHub Actions gives us the freedom to schedule and trigger the workflow as we wish, while dbt Cloud provides a familiar environment for Analytics Engineers to model and setup jobs. fal lets you run any Python script on top of your dbt models, such as forecasts and sentiment analysis on text. Check out our repository to learn more about fal and see other examples. Join our Discord server to request features, give us feedback and resolve any technical issues.