Skip to main content

Run SQL Queries

As part of our open source, open data, open infrastructure initiative, we are making OSO data as widely available as possible. Use this guide to download the latest data for our own data stack.

Please refer to the getting started guide to first get your BigQuery account setup.

Subscribe an OSO dataset

First, we need to subscribe to an OSO dataset in your own Google Cloud account. You can see all of our available datasets in the Data Overview.

We recommend starting with the OSO production data pipeline here:

Subscribe on BigQuery

After subscribing, you can reference the dataset within your GCP project namespace, for example: YOUR_PROJECT_NAME.oso_production

Cost Estimation

BigQuery on-demand pricing charges based on the number of bytes scanned, with the first 1 TB free every month.

To keep track of your usage, check the bytes scanned in the top right corner before running your query.

cost estimate

Exploring the data

The OSO data pipeline is fully visible to queries. All model definitions can be found under warehouse/dbt/models/ in our monorepo.

We also maintain reference documentation at https://models.opensource.observer/, which includes a model lineage graph to help you understand the schema of any model to form your queries.. lineage graph.

Generally speaking there are three types of models:

  1. Staging models and source data: For each data source, staging models are created to clean and normalize the necessary subset of data.
  2. Intermediate models: Here, we join all data sources into a master event table, int_events. Then, we produce a series of aggregations such as int_events_daily_to_project
  3. Mart models: From the intermediate models, we create the final metrics models that are served from the API.

Cost optimization

Typically, downstream models are typically smaller than upstream models, like source data. So, it is generally recommended to use the model that is further downstream in the lineage graph that can satisfy your query. Each stage of the pipeline typically reduces the size of the data by 1-2 orders of magnitude.

If there is an intermediate model addition (such as a new event type or aggregation) that you think could help save costs for others in the future, please consider contributing to our data models.