Getting Started with Dagster

Dagster is a modern data orchestrator that we use to declare all data assets in software. OSO uses it to schedule all data jobs, from data collectors to our transformation pipeline. This quickstart guide will help you set up a development Dagster instance locally, with a duckdb backend, in order to follow along with our tutorials in the next sections.

If you want to check out what Dagster looks like in production, check out
https://dagster.opensource.observer
Admins can trigger runs here

Setting up Dagster

First, we need to clone the OSO monorepo and install the required dependencies using uv:

git clone [email protected]:opensource-observer/oso.git
cd oso/
uv sync

Create a directory to store Dagster state. It is safe to delete this directory, so we typically store this in /tmp.

mkdir /tmp/dagster-home

Copy .env.example to .env, and fill it in with the required environment variables:

DAGSTER_HOME=/tmp/dagster-home

Lastly, we need to configure dagster.yaml to disable concurrency. Our example is located at /tmp/dagster-home/dagster.yaml:

run_queue:
  max_concurrent_runs: 1

This is currently a limitation with our duckdb integration. Please check out this issue for more information.

Running Dagster

Now that we have everything set up, we can run the Dagster instance:

uv run dagster dev

tip

You may need to run the development server in legacy mode on resource-constrained machines. See this issue for more details.

uv run dagster dev --use-legacy-code-server-behavior

After a little bit of time, you should see the following message:

2024-09-10 22:35:31 +0200 - dagster.daemon - INFO - Instance is configured with the following daemons: ['AssetDaemon', 'BackfillDaemon', 'QueuedRunCoordinatorDaemon', 'SchedulerDaemon', 'SensorDaemon']
2024-09-10 22:35:31 +0200 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 1095

Head over to http://localhost:3000 to access Dagster's UI. Et voilà! You have successfully set up Dagster locally.

Define a new Dagster asset

Now you're ready to create a new Dagster software-defined asset. You can use one of the following guides and come back to this guide to test it.

🗂️ BigQuery Public Datasets - Preferred and easiest route for sharing a dataset
🗄️ Database Replication - Provide access to your database for replication as an OSO dataset
📈 GraphQL API Crawler - Automatically crawl any GraphQL API
🌐 REST API Crawler - Automatically crawl any REST API
📁 Files into Google Cloud Storage (GCS) - Drop Parquet/CSV files in our GCS bucket for loading into OSO
⚙️ Custom Dagster Assets - Write a custom Dagster asset for unique data sources
📜 Static Files - Coordinate hand-off for high-quality data via static files. This path is predominantly used for grant funding data.

Test your asset locally

Assets in warehouse/oso_dagster/assets should automatically show up in the Dagster assets list at http://localhost:3000/assets.

Dagster assets

Click on "Materialize" to start the job. Here you'll be able to monitor the logs to debug any issues with the data fetching.

warning

Unless your Dagster instance is configured with a Google account that has write access to OSO BigQuery datasets, you should expect an error message when the asset tries to write. Focus on debugging any issues with fetching data. When you're ready, work with a core team member to test the asset in production.

Add your asset to production

Submit a pull request

When you are ready to deploy, submit a pull request of your changes to OSO. OSO maintainers will work with you to get the code in shape for merging. For more details on contributing to OSO, check out CONTRIBUTING.md.

Verify deployment

Our Dagster deployment should automatically recognize the asset after merging your pull request to the main branch. You should be able to find your new asset in the global asset list.

Dagster assets

If your asset is missing, you can check for loading errors and the date of last code load in the Deployment tab. For example, if your code has a bug and leads to a loading error, it may look like this:

Dagster deployment

Run it!

If this is your first time adding an asset, we suggest reaching out to the OSO team over Discord to run deploys manually. You can monitor all Dagster runs here.

Dagster run example

Dagster also provides automation to run jobs on a schedule (e.g. daily), after detecting a condition using a Python-defined sensor (e.g. when a file appears in GCS), and using auto-materialization policies.

We welcome any automation that can reduce the operational burden in our continuous deployment. However, before merging any job automation, please reach out to the OSO devops team on Discord with an estimate of costs, especially if it involves large BigQuery scans. We will reject or disable any jobs that lead to increased infrastructure instability or unexpected costs.

Setting up Dagster​

Running Dagster​

Define a new Dagster asset​

Test your asset locally​

Add your asset to production​

Submit a pull request​

Verify deployment​

Run it!​