Skip to main content

Core Models

info

This document provides a comprehensive reference for the core data models in the OSO data lake. Models with a v1 suffix are intended to be stable and not changed without a major version release. Models with a v0 suffix are still in development and may be changed or removed without warning.

Overview

OSO's data warehouse is built around several core models that represent different aspects of open source projects. These models are designed to be flexible and extensible, allowing for the addition of new data sources and metrics over time.

The core models can be categorized into three main groups:

  1. Directory Models: Define the structure and relationships between projects, collections, and artifacts
  2. Metrics Models: Define the metrics that can be calculated for projects
  3. Timeseries Models: Store historical data about metrics over time

Key Fields and ID Generation

Key Fields

The following fields have the most semantic meaning across the OSO data models:

FieldDescriptionExample Values
collection_sourceSource of the collection data"OSS_DIRECTORY", "OP_ATLAS"
collection_namespaceNamespace of the collection"oso", "retro-funding"
collection_nameName of the collection"ethereum-crypto-ecosystems", "optimism", "gitcoin-oss"
project_sourceSource of the project data"OSS_DIRECTORY", "OP_ATLAS"
project_namespaceNamespace of the project, where applicable"oso", "" (empty string)
project_nameName of the project"safe-global", "zora", "stackup"
artifact_sourceSource of the artifact data"BASE", "GITHUB", "DEFILLAMA", "NPM", "WWW"
artifact_namespaceNamespace of the artifact, where applicable"" (empty string), "code-423n4", "opencollective", "mozilla"
artifact_nameName of the artifact"go-ethereum", "0x76fd297e2d437cd7f76d50f01afe6160f86e9990", "@uniswap/v2-sdk", "uniswap-v3"
metric_sourceSource of the metric data"OSO"
metric_namespaceNamespace of the metric"oso"
metric_nameName of the metric"GITHUB_opened_pull_requests_monthly", "SONEIUM_transactions_weekly", "BASE_gas_fees_over_all_time", "OPEN_COLLECTIVE_funding_received_daily"

ID Generation

In OSO, IDs are generated by hashing the combination of SOURCE, namespace, and name. This approach ensures that:

  1. IDs are consistent and deterministic
  2. The same entity will always have the same ID
  3. Different entities will have different IDs

For example, a project ID might be generated by hashing:

@oso_entity_id("OSS_DIRECTORY", "oso", "safe-global")

This pattern is used for all entity IDs in the system, including:

  • collection_id
  • project_id
  • artifact_id
  • metric_id
important

Note that the OSO-generated *_id is not the same as the *_source_id. The source ID is the ID of the entity in the source system (e.g., GitHub, Defillama, etc.). The OSO ID is a unique identifier for the entity in the OSO data lake only. All OSO queries should use the OSO ID rather than the source ID.

Directory Models

collections_v1

The collections_v1 table represents collections of projects in the OSO data warehouse. Collections are used to group related projects together, such as projects that are part of the same ecosystem or organization.

FieldTypeDescription
collection_idTEXTUnique identifier for the collection
collection_sourceTEXTSource of the collection data (e.g., "OSS_DIRECTORY", "OP_ATLAS")
collection_namespaceTEXTNamespace of the collection (e.g., "oso", "retro-funding")
collection_nameTEXTName of the collection (e.g., "ethereum-crypto-ecosystems", "optimism", "gitcoin-oss")
display_nameTEXTHuman-readable name for the collection
descriptionTEXTDescription of the collection

projects_v1

The projects_v1 table represents individual projects in the OSO data warehouse. Projects are the primary entities that metrics are calculated for.

FieldTypeDescription
project_idTEXTUnique identifier for the project
project_sourceTEXTSource of the project data (e.g., "OSS_DIRECTORY", "OP_ATLAS")
project_namespaceTEXTNamespace of the project (e.g., "oso", "")
project_nameTEXTName of the project (e.g., "safe-global", "zora", "stackup")
display_nameTEXTHuman-readable name for the project
descriptionTEXTDescription of the project

artifacts_by_project_v1

The artifacts_by_project_v1 table represents the relationship between projects and their artifacts. Artifacts are specific instances of a project, such as a GitHub repository or an npm package.

FieldTypeDescription
artifact_idTEXTUnique identifier for the artifact
artifact_source_idTEXTID of the artifact in its source system
artifact_sourceTEXTSource of the artifact data (e.g., "BASE", "GITHUB", "DEFILLAMA", "NPM", "WWW")
artifact_namespaceTEXTNamespace of the artifact (e.g., "", "code-423n4", "opencollective", "mozilla")
artifact_nameTEXTName of the artifact (e.g., "go-ethereum", "0x76fd297e2d437cd7f76d50f01afe6160f86e9990", "@uniswap/v2-sdk", "uniswap-v3")
project_idTEXTID of the project this artifact belongs to
project_sourceTEXTSource of the project data
project_namespaceTEXTNamespace of the project
project_nameTEXTName of the project

projects_by_collection_v1

The projects_by_collection_v1 table represents the relationship between projects and collections. This allows for projects to be part of multiple collections.

FieldTypeDescription
project_idTEXTID of the project
project_sourceTEXTSource of the project data
project_namespaceTEXTNamespace of the project
project_nameTEXTName of the project
collection_idTEXTID of the collection
collection_sourceTEXTSource of the collection data
collection_namespaceTEXTNamespace of the collection
collection_nameTEXTName of the collection

Metrics Models

metrics_v0

The metrics_v0 table defines the metrics that can be calculated for projects. Each metric has a unique identifier and metadata about how it is calculated.

FieldTypeDescription
metric_idTEXTUnique identifier for the metric
metric_sourceTEXTSource of the metric data (e.g., "OSO")
metric_namespaceTEXTNamespace of the metric (e.g., "oso")
metric_nameTEXTName of the metric (e.g., "GITHUB_opened_pull_requests_monthly", "SONEIUM_transactions_weekly", "BASE_gas_fees_over_all_time", "OPEN_COLLECTIVE_funding_received_daily")
display_nameTEXTHuman-readable name for the metric
descriptionTEXTDescription of the metric
rendered_sqlARRAYSQL queries used to calculate the metric
sql_source_pathTEXTPath to the SQL file that defines the metric
aggregation_functionTEXTFunction used to aggregate the metric (e.g., "SUM", "AVG")

Timeseries Models

key_metrics_by_project_v0

The key_metrics_by_project_v0 table stores key metrics for projects at specific points in time. These metrics are typically calculated on a regular schedule (e.g., daily, weekly).

FieldTypeDescription
metric_idTEXTID of the metric
project_idTEXTID of the project
sample_dateDATEDate when the metric was calculated
amountDOUBLEValue of the metric
unitTEXTUnit of the metric (e.g., "stars", "commits")

timeseries_metrics_by_project_v0

The timeseries_metrics_by_project_v0 table stores historical data about metrics for projects over time. This allows for tracking how metrics change over time.

FieldTypeDescription
metric_idTEXTID of the metric
project_idTEXTID of the project
sample_dateDATEDate when the metric was calculated
amountDOUBLEValue of the metric
unitTEXTUnit of the metric (e.g., "stars", "commits")

Relationships Between Models

The core models are related to each other in the following ways:

  • Projects can have multiple artifacts (one-to-many relationship)
  • Projects can belong to multiple collections (many-to-many relationship)
  • Collections can contain multiple projects (one-to-many relationship)
  • Projects can have multiple metrics (one-to-many relationship)
  • Metrics can be calculated for multiple projects (one-to-many relationship)
  • Projects can have multiple timeseries data points (one-to-many relationship)
  • Metrics can have multiple timeseries data points (one-to-many relationship)

Examples

Finding Projects by Name

import os
import pandas as pd
from pyoso import Client

OSO_API_KEY = os.environ['OSO_API_KEY']
client = Client(api_key=OSO_API_KEY)

query = """
SELECT
project_id,
project_name,
display_name,
description
FROM projects_v1
WHERE lower(display_name) LIKE lower('%ethereum%')
"""
df = client.to_pandas(query)
print(df)

Finding Projects by Artifact

query = """
SELECT
project_id,
project_name,
artifact_namespace as github_owner,
artifact_name as github_repo
FROM artifacts_by_project_v1
WHERE
artifact_source = 'GITHUB'
AND artifact_namespace LIKE '%ethereum%'
"""
df = client.to_pandas(query)
print(df)

Getting Metrics for a Project

query = """
SELECT
tm.sample_date,
m.metric_name,
tm.amount,
tm.unit
FROM timeseries_metrics_by_project_v0 AS tm
JOIN metrics_v0 AS m
ON tm.metric_id = m.metric_id
JOIN projects_v1 AS p
ON tm.project_id = p.project_id
WHERE
p.project_name = 'ethereum'
AND m.metric_name IN (
'GITHUB_stars_daily',
'GITHUB_forks_daily',
'GITHUB_commits_daily',
'GITHUB_contributors_daily'
)
ORDER BY tm.sample_date DESC
LIMIT 10
"""
df = client.to_pandas(query)
print(df)

Finding Collections and Their Projects

query = """
SELECT
c.collection_id,
c.collection_name,
c.display_name,
p.project_id,
p.project_name,
p.display_name as project_display_name
FROM collections_v1 AS c
JOIN projects_by_collection_v1 AS pbc
ON c.collection_id = pbc.collection_id
JOIN projects_v1 AS p
ON pbc.project_id = p.project_id
WHERE
c.collection_name LIKE '%ethereum%'
ORDER BY c.collection_name, p.project_name
"""
df = client.to_pandas(query)
print(df)