Dagster Playbook
This guide outlines common operations for working with Dagster in the OSO project.
SQLMesh Integration
Our main data models are materialized using SQLMesh. In most cases, you can trigger the sqlmesh_all_assets
job with its default configuration to update the models.
Restatements
DO NOT RESTATE SQLMesh models without approval!
If you need to run a restatement, you will need to edit the configuration of the sqlmesh_all_assets
job. Select the dropdown menu next to the Materialize all button and click Open launchpad.
There are two ways to specify which models to restate:
- By Entity Category: Set
restate_by_entity_category: true
and specify a list of categories to restate. You can assign categories to models using theentity_category=category_name
tag. - By Model Name: Provide a list of model names under the
restate_models
configuration. Remember to prefix the model name withoso.
, for example:oso.int_events__superchain_internal_transactions
. When using this method, all SQLMesh model selection features can be used.
Dagster jobs have a default of three retry attempts. However, retries use the same configuration. If a job fails mid-process, cancel the retry and trigger a new run with the correct configuration to avoid restating models multiple times.
An example configuration:
ops:
sqlmesh_project:
config:
restate_by_entity_category: false
restate_models:
- oso.stg_github__XYZ
skip_tests: false
use_dev_environment: false
This will restate the stg_github__XYZ
staging models and all downstream SQLMesh models in the warehouse.
Branching with Tags
We use Nessie's branching feature to ensure data consumers always have access to stable data. We maintain a consumer
tag that points to a stable version of the data for public API consumers, while the main
branch is actively updated.
Our producer, Trino, has two catalogs:
iceberg
: Points to themain
branch.iceberg_consumer
: Points to theconsumer
tag.
After a successful run of the sqlmesh_all_assets
job and data verification, run the nessie_consumer_tag_job
to update the consumer
tag to the latest main
commit. You can also specify a particular hash in the to_hash
configuration to move the tag to a specific commit.
Asset Development Workflow
When creating new Dagster assets, it's important to also write a seed file before integrating it into SQLMesh.
The workflow is as follows:
-
Write the Asset:
- Follow the cursor rules for creating new assets.
- Keep column names consistent with the original source.
- Perform minimal normalization and unnesting.
-
Run Dagster Locally:
- Confirm that you can materialize the source correctly.
-
Submit and Merge a PR:
- Submit a pull request with your changes and merge it into production.
-
Materialize in Production:
- Materialize the asset in the production Dagster environment.
-
Verify Data:
- Sample the data in BigQuery to confirm it's correct.
-
Create Seed File and Staging Model:
- Follow the cursor rules for creating seed files and staging models.
- Use a sample of 5-10 rows of real data from BigQuery that cover multiple cases.
- If there are date fields, set them to
datetime.now()
. - Test locally with SQLMesh until there are no errors.
-
Submit and Merge a PR:
- Submit a pull request with the seed file and staging model and merge it into production.