Skip to main content

Technical Architecture

info

OSO's goal is to make it simple to contribute by providing an automatically deployed data pipeline so that the community can build this open data warehouse together. All of the code for this architecture is available to view/copy/redeploy from the OSO Monorepo.

Diagram

The following diagram illustrates Open Source Observer's technical architecture.

OSO Architecture Diagram

Major Components

The architecture has the following major components.

Data Warehouse

Currently all data is stored and processed in Google BigQuery. All of the collected data or aggregated views used by OSO is also made publicly available here (if it is not already a public dataset on BigQuery). Anyone with can view, query, or build off of any stage in the pipeline. In the future we plan to explore a decentralized lakehouse.

Data Orchestration

Dagster is the central orchestration system, which manages the entire pipeline, from the data ingestion, the dbt pipeline, to copying marts to data serving infrastructure.

API

The API can be used by external developers to integrate insights from OSO. Rate limits or cost sharing subscriptions may apply to it's usage depending on the systems used. This also powers the OSO website.

Website

This is the OSO website at https://www.opensource.observer. This website provides an easy to use public view into the data.

Dependent Technologies

Our infrastructure is based on many wonderful existing tools. Our major dependencies are:

  • Google BigQuery
    • As explained above, all of the data that OSO collects and materializes lives in public datasets in BigQuery.
  • Dagster
    • Dagster orchestrates all data jobs, including the collection of data from external sources as well as handling the flow of data through the main data pipeline.
  • dbt
    • This is used for data transformations to turn collected data into useful materializations for the OSO API and website.
  • OLAP database
    • All dbt mart models are copied to an OLAP database for real-time queries. This database powers the OSO API, which in turn powers the OSO website.

Indexing Pipeline

OSO maintains an ETL data pipeline that is continuously deployed from our monorepo and regularly indexes all available event data about projects in the oss-directory.

  • Extract: raw event data from a variety of public data sources (e.g., GitHub, blockchains, npm, Open Collective)
  • Transform: the raw data into impact metrics and impact vectors per project (e.g., # of active developers)
  • Load: the results into various OSO data products (e.g., our API, website, widgets)

Open Architecture for Open Source Data

The architecture is designed to fully open to open source collaboration. With contributions and guidance from the community, we want Open Source Observer to evolve as we better understand what impact looks like in different domains.

All code is open source in our monorepo. All data, including every stage in our pipeline, is publicly available on BigQuery. All data orchestration is visible in our public Dagster dashboard.

You can read more about our open philosophy on our blog.