Getting Started with Dagster
Dagster is a modern data orchestrator that we use to declare
all data assets in software. OSO uses it to schedule all data jobs,
from data collectors to our transformation pipeline.
This quickstart guide will help you set
up a development Dagster instance locally, with a duckdb
backend,
in order to follow along with our tutorials in the next sections.
If you want to check out what Dagster looks like in production, check out
https://dagster.opensource.observer
Admins can trigger runs
here
Setting up Dagster
First, we need to clone the OSO monorepo and install the required dependencies using uv:
git clone [email protected]:opensource-observer/oso.git
cd oso/
uv sync
Create a directory to store Dagster state. It is safe to delete this directory,
so we typically store this in /tmp
.
mkdir /tmp/dagster-home
Copy .env.example
to .env
, and fill it in with the required environment variables:
DAGSTER_HOME=/tmp/dagster-home
Lastly, we need to configure dagster.yaml
to disable concurrency. Our example
is located at /tmp/dagster-home/dagster.yaml
:
run_queue:
max_concurrent_runs: 1
This is currently a limitation with our duckdb
integration. Please check out
this issue
for more information.
Running Dagster
Now that we have everything set up, we can run the Dagster instance:
uv run dagster dev
You may need to run the development server in legacy mode on resource-constrained machines. See this issue for more details.
uv run dagster dev --use-legacy-code-server-behavior
After a little bit of time, you should see the following message:
2024-09-10 22:35:31 +0200 - dagster.daemon - INFO - Instance is configured with the following daemons: ['AssetDaemon', 'BackfillDaemon', 'QueuedRunCoordinatorDaemon', 'SchedulerDaemon', 'SensorDaemon']
2024-09-10 22:35:31 +0200 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 1095
Head over to http://localhost:3000 to access Dagster's UI. Et voilà! You have successfully set up Dagster locally.
Define a new Dagster asset
Now you're ready to create a new Dagster software-defined asset. You can use one of the following guides and come back to this guide to test it.
- 🗂️ BigQuery Public Datasets - Preferred and easiest route for sharing a dataset
- 🗄️ Database Replication - Provide access to your database for replication as an OSO dataset
- 📈 GraphQL API Crawler - Automatically crawl any GraphQL API
- 🌐 REST API Crawler - Automatically crawl any REST API
- 📁 Files into Google Cloud Storage (GCS) - Drop Parquet/CSV files in our GCS bucket for loading into OSO
- ⚙️ Custom Dagster Assets - Write a custom Dagster asset for unique data sources
- 📜 Static Files - Coordinate hand-off for high-quality data via static files. This path is predominantly used for grant funding data.
Test your asset locally
Assets in warehouse/oso_dagster/assets
should automatically show up in
the Dagster assets list at http://localhost:3000/assets
.
Click on "Materialize" to start the job. Here you'll be able to monitor the logs to debug any issues with the data fetching.
Unless your Dagster instance is configured with a Google account that has write access to OSO BigQuery datasets, you should expect an error message when the asset tries to write. Focus on debugging any issues with fetching data. When you're ready, work with a core team member to test the asset in production.
Add your asset to production
Submit a pull request
When you are ready to deploy, submit a pull request of your changes to OSO. OSO maintainers will work with you to get the code in shape for merging. For more details on contributing to OSO, check out CONTRIBUTING.md.
Verify deployment
Our Dagster deployment should automatically recognize the asset after merging your pull request to the main branch. You should be able to find your new asset in the global asset list.
If your asset is missing, you can check for loading errors and the date of last code load in the Deployment tab. For example, if your code has a bug and leads to a loading error, it may look like this:
Run it!
If this is your first time adding an asset, we suggest reaching out to the OSO team over Discord to run deploys manually. You can monitor all Dagster runs here.
Dagster also provides automation to run jobs on a schedule (e.g. daily), after detecting a condition using a Python-defined sensor (e.g. when a file appears in GCS), and using auto-materialization policies.
We welcome any automation that can reduce the operational burden in our continuous deployment. However, before merging any job automation, please reach out to the OSO devops team on Discord with an estimate of costs, especially if it involves large BigQuery scans. We will reject or disable any jobs that lead to increased infrastructure instability or unexpected costs.