Run SQL Queries
As part of our open source, open data, open infrastructure initiative, we are making OSO data as widely available as possible. Use this guide to download the latest data for our own data stack.
Please refer to the getting started guide to first get your BigQuery account setup.
Subscribe an OSO dataset
First, we need to subscribe to an OSO dataset in your own Google Cloud account. You can see all of our available datasets in the Data Overview.
We recommend starting with the OSO production data pipeline here:
After subscribing, you can reference the dataset within your GCP project namespace, for example:
YOUR_PROJECT_NAME.oso_production
Cost Estimation
BigQuery on-demand pricing charges based on the number of bytes scanned, with the first 1 TB free every month.
To keep track of your usage, check the bytes scanned in the top right corner before running your query.
BigQuery costs can rack up quickly if you are not careful optimizing your queries. For more on how you can optimize your queries, check out these guide:
Exploring the data
The OSO data pipeline is fully visible to queries.
All model definitions can be found under
warehouse/dbt/models/
in our
monorepo.
We also maintain reference documentation at https://models.opensource.observer/, which includes a model lineage graph to help you understand the schema of any model to form your queries.. lineage graph.
Generally speaking there are three types of models:
- Staging models and source data: For each data source, staging models are created to clean and normalize the necessary subset of data.
- Intermediate models:
Here, we join all data sources into a master event table,
int_events
. Then, we produce a series of aggregations such asint_events_daily_to_project
- Mart models: From the intermediate models, we create the final metrics models that are served from the API.
Cost optimization
Typically, downstream models are typically smaller than upstream models, like source data. So, it is generally recommended to use the model that is further downstream in the lineage graph that can satisfy your query. Each stage of the pipeline typically reduces the size of the data by 1-2 orders of magnitude.
If there is an intermediate model addition (such as a new event type or aggregation) that you think could help save costs for others in the future, please consider contributing to our data models.