Skip to main content

dbt Setup

warning

We are in the process of deprecating our dbt setup in favor of sqlmesh. We have left the dbt setup in case there are models we still need to run manually.

We use dbt to run a BigQuery-based data pipeline. You can view every dbt model on OSO here: https://models.opensource.observer.

This guide walks you through setting up dbt (Data Build Tool) for OSO development.

Prerequisites

  • Python >=3.11 (see here)
  • Python uv >= 0.6 (see here)
  • git (see here)
  • A GitHub account
  • BigQuery access
  • gcloud CLI (see here)

Setup dependencies

  1. Activate the virtual environment:
source .venv/bin/activate
  1. Verify dbt is installed:
which dbt
  1. Authenticate with gcloud:
gcloud auth application-default login
  1. Run the setup wizard:
uv sync && uv run oso_lets_go
tip

The script is idempotent, so you can safely run it again if you encounter any issues. The wizard will create a GCP project and BigQuery dataset if needed, copy a subset of OSO data for development, and configure your dbt profile.

Configuration

dbt Profile Setup

Create or edit ~/.dbt/profiles.yml:

opensource_observer:
outputs:
production:
type: bigquery
dataset: oso
job_execution_time_seconds: 300
job_retries: 1
location: US
method: oauth
project: opensource-observer
threads: 32
playground:
type: bigquery
dataset: oso_playground
job_execution_time_seconds: 300
job_retries: 1
location: US
method: oauth
project: opensource-observer
threads: 32
target: playground

VS Code Setup

  1. Install the Power User for dbt core extension

  2. Get your virtual environment path:

echo 'import sys; print(sys.prefix)' | uv run -
  1. In VS Code:
    • Open command palette
    • Select "Python: select interpreter"
    • Choose "Enter interpreter path..."
    • Enter the virtual path

Running dbt

Basic usage:

dbt run

Target specific model:

dbt run --select {model_name}
tip

By default, this writes to the opensource-observer.oso_playground dataset.

You can view all of our models and their documentation at https://models.opensource.observer.

Model Organization

Our dbt models are located in warehouse/dbt/models/ and follow these conventions:

  • staging/ - Extract raw source data with minimal transformation (usually materialized as views)
  • intermediate/ - Transform staging data into useful representations
  • marts/ - Final aggregations and metrics (copied to frontend database)

Development Workflow

  1. Write Your Model

    • Add SQL files to appropriate directory in warehouse/dbt/models/
    • Use existing models as references
  2. Test Locally

    dbt run --select your_model_name

    By default, this will run the oso_playground dataset. If you want to run against the production dataset, you can use the --target flag:

    dbt run --select your_model_name --target production

    You can also run all models downstream of a given model by appending a + to the model name:

    dbt run --select your_model_name+
  3. Using BigQuery UI

    dbt compile --select your_model_name

    Find compiled SQL in target/ directory to run in BigQuery Console

    Alternatively, you can used the pbcopy command to copy the SQL to your clipboard:

    dbt compile --select your_model_name | pbcopy
  4. Submit PR

FAQ

Defining a dbt source

In order to make a new dataset available to the data pipeline, you need to add it as a dbt source. In this example, we create a source in oso/warehouse/dbt/models/ (see source) for the Ethereum mainnet public dataset.

sources:
- name: ethereum
database: bigquery-public-data
schema: crypto_ethereum
tables:
- name: transactions
identifier: transactions
- name: traces
identifier: traces

We can then reference these tables in downstream models with the source macro:

select
block_timestamp,
`hash` as transaction_hash,
from_address,
receipt_contract_address
from {{ source("ethereum", "transactions") }}

Referring to Data Sources in Intermediate and Marts Models

All intermediate and mart models should refer to data sources using the ref() macro.

For example, to refer to staged data from the github_archive source, you can use:

{{ ref('stg_github__events') }}

And to refer to data from the int_events intermediate model, you can use:

{{ ref('int_events') }}

Referring to Data Sources in Staging Models

Staging models can refer to data sources using the source() macro.

For example, to refer to raw data from the github_archive source, you can use:

{{ source('github_archive', 'events') }}

The oso_source macro

Use oso_source() instead of source() to help manage our playground dataset:

{{ oso_source('namespace', 'table_name') }}

Working with IDs

Use the oso_id() macro to generate consistent identifiers across our data models. This macro creates a URL-safe base64 encoding of a SHA256 hash of the concatenated input parameters.

For example, to generate an artifact ID:

{{ oso_id("artifact_source", "artifact_source_id") }}

Reference Documentation