Core Models
This document provides a comprehensive reference for the core data models in the OSO data lake. Models with a v1
suffix are intended to be stable and not changed without a major version release. Models with a v0
suffix are still in development and may be changed or removed without warning.
Overview
OSO's data warehouse is built around several core models that represent different aspects of open source projects. These models are designed to be flexible and extensible, allowing for the addition of new data sources and metrics over time.
The core models can be categorized into three main groups:
- Directory Models: Define the structure and relationships between projects, collections, and artifacts
- Metrics Models: Define the metrics that can be calculated for projects
- Timeseries Models: Store historical data about metrics over time
Key Fields and ID Generation
Key Fields
The following fields have the most semantic meaning across the OSO data models:
Field | Description | Example Values |
---|---|---|
collection_source | Source of the collection data | "OSS_DIRECTORY", "OP_ATLAS" |
collection_namespace | Namespace of the collection | "oso", "retro-funding" |
collection_name | Name of the collection | "ethereum-crypto-ecosystems", "optimism", "gitcoin-oss" |
project_source | Source of the project data | "OSS_DIRECTORY", "OP_ATLAS" |
project_namespace | Namespace of the project, where applicable | "oso", "" (empty string) |
project_name | Name of the project | "safe-global", "zora", "stackup" |
artifact_source | Source of the artifact data | "BASE", "GITHUB", "DEFILLAMA", "NPM", "WWW" |
artifact_namespace | Namespace of the artifact, where applicable | "" (empty string), "code-423n4", "opencollective", "mozilla" |
artifact_name | Name of the artifact | "go-ethereum", "0x76fd297e2d437cd7f76d50f01afe6160f86e9990", "@uniswap/v2-sdk", "uniswap-v3" |
metric_source | Source of the metric data | "OSO" |
metric_namespace | Namespace of the metric | "oso" |
metric_name | Name of the metric | "GITHUB_opened_pull_requests_monthly", "SONEIUM_transactions_weekly", "BASE_gas_fees_over_all_time", "OPEN_COLLECTIVE_funding_received_daily" |
ID Generation
In OSO, IDs are generated by hashing the combination of SOURCE, namespace, and name. This approach ensures that:
- IDs are consistent and deterministic
- The same entity will always have the same ID
- Different entities will have different IDs
For example, a project ID might be generated by hashing:
@oso_entity_id("OSS_DIRECTORY", "oso", "safe-global")
This pattern is used for all entity IDs in the system, including:
collection_id
project_id
artifact_id
metric_id
Note that the OSO-generated *_id
is not the same as the *_source_id
. The source ID is the ID of the entity in the source system (e.g., GitHub, Defillama, etc.). The OSO ID is a unique identifier for the entity in the OSO data lake only. All OSO queries should use the OSO ID rather than the source ID.
Directory Models
collections_v1
The collections_v1
table represents collections of projects in the OSO data warehouse. Collections are used to group related projects together, such as projects that are part of the same ecosystem or organization.
Field | Type | Description |
---|---|---|
collection_id | TEXT | Unique identifier for the collection |
collection_source | TEXT | Source of the collection data (e.g., "OSS_DIRECTORY", "OP_ATLAS") |
collection_namespace | TEXT | Namespace of the collection (e.g., "oso", "retro-funding") |
collection_name | TEXT | Name of the collection (e.g., "ethereum-crypto-ecosystems", "optimism", "gitcoin-oss") |
display_name | TEXT | Human-readable name for the collection |
description | TEXT | Description of the collection |
projects_v1
The projects_v1
table represents individual projects in the OSO data warehouse. Projects are the primary entities that metrics are calculated for.
Field | Type | Description |
---|---|---|
project_id | TEXT | Unique identifier for the project |
project_source | TEXT | Source of the project data (e.g., "OSS_DIRECTORY", "OP_ATLAS") |
project_namespace | TEXT | Namespace of the project (e.g., "oso", "") |
project_name | TEXT | Name of the project (e.g., "safe-global", "zora", "stackup") |
display_name | TEXT | Human-readable name for the project |
description | TEXT | Description of the project |
artifacts_by_project_v1
The artifacts_by_project_v1
table represents the relationship between projects and their artifacts. Artifacts are specific instances of a project, such as a GitHub repository or an npm package.
Field | Type | Description |
---|---|---|
artifact_id | TEXT | Unique identifier for the artifact |
artifact_source_id | TEXT | ID of the artifact in its source system |
artifact_source | TEXT | Source of the artifact data (e.g., "BASE", "GITHUB", "DEFILLAMA", "NPM", "WWW") |
artifact_namespace | TEXT | Namespace of the artifact (e.g., "", "code-423n4", "opencollective", "mozilla") |
artifact_name | TEXT | Name of the artifact (e.g., "go-ethereum", "0x76fd297e2d437cd7f76d50f01afe6160f86e9990", "@uniswap/v2-sdk", "uniswap-v3") |
project_id | TEXT | ID of the project this artifact belongs to |
project_source | TEXT | Source of the project data |
project_namespace | TEXT | Namespace of the project |
project_name | TEXT | Name of the project |
projects_by_collection_v1
The projects_by_collection_v1
table represents the relationship between projects and collections. This allows for projects to be part of multiple collections.
Field | Type | Description |
---|---|---|
project_id | TEXT | ID of the project |
project_source | TEXT | Source of the project data |
project_namespace | TEXT | Namespace of the project |
project_name | TEXT | Name of the project |
collection_id | TEXT | ID of the collection |
collection_source | TEXT | Source of the collection data |
collection_namespace | TEXT | Namespace of the collection |
collection_name | TEXT | Name of the collection |
Metrics Models
metrics_v0
The metrics_v0
table defines the metrics that can be calculated for projects. Each metric has a unique identifier and metadata about how it is calculated.
Field | Type | Description |
---|---|---|
metric_id | TEXT | Unique identifier for the metric |
metric_source | TEXT | Source of the metric data (e.g., "OSO") |
metric_namespace | TEXT | Namespace of the metric (e.g., "oso") |
metric_name | TEXT | Name of the metric (e.g., "GITHUB_opened_pull_requests_monthly", "SONEIUM_transactions_weekly", "BASE_gas_fees_over_all_time", "OPEN_COLLECTIVE_funding_received_daily") |
display_name | TEXT | Human-readable name for the metric |
description | TEXT | Description of the metric |
rendered_sql | ARRAY | SQL queries used to calculate the metric |
sql_source_path | TEXT | Path to the SQL file that defines the metric |
aggregation_function | TEXT | Function used to aggregate the metric (e.g., "SUM", "AVG") |
Timeseries Models
key_metrics_by_project_v0
The key_metrics_by_project_v0
table stores key metrics for projects at specific points in time. These metrics are typically calculated on a regular schedule (e.g., daily, weekly).
Field | Type | Description |
---|---|---|
metric_id | TEXT | ID of the metric |
project_id | TEXT | ID of the project |
sample_date | DATE | Date when the metric was calculated |
amount | DOUBLE | Value of the metric |
unit | TEXT | Unit of the metric (e.g., "stars", "commits") |
timeseries_metrics_by_project_v0
The timeseries_metrics_by_project_v0
table stores historical data about metrics for projects over time. This allows for tracking how metrics change over time.
Field | Type | Description |
---|---|---|
metric_id | TEXT | ID of the metric |
project_id | TEXT | ID of the project |
sample_date | DATE | Date when the metric was calculated |
amount | DOUBLE | Value of the metric |
unit | TEXT | Unit of the metric (e.g., "stars", "commits") |
Relationships Between Models
The core models are related to each other in the following ways:
- Projects can have multiple artifacts (one-to-many relationship)
- Projects can belong to multiple collections (many-to-many relationship)
- Collections can contain multiple projects (one-to-many relationship)
- Projects can have multiple metrics (one-to-many relationship)
- Metrics can be calculated for multiple projects (one-to-many relationship)
- Projects can have multiple timeseries data points (one-to-many relationship)
- Metrics can have multiple timeseries data points (one-to-many relationship)
Examples
Finding Projects by Name
- Python
- GraphQL
import os
import pandas as pd
from pyoso import Client
OSO_API_KEY = os.environ['OSO_API_KEY']
client = Client(api_key=OSO_API_KEY)
query = """
SELECT
project_id,
project_name,
display_name,
description
FROM projects_v1
WHERE lower(display_name) LIKE lower('%ethereum%')
"""
df = client.to_pandas(query)
print(df)
query FindProjects {
oso_projectsV1(where: { displayName: { _ilike: "%ethereum%" } }) {
projectId
projectName
displayName
description
}
}
Finding Projects by Artifact
- Python
- GraphQL
query = """
SELECT
project_id,
project_name,
artifact_namespace as github_owner,
artifact_name as github_repo
FROM artifacts_by_project_v1
WHERE
artifact_source = 'GITHUB'
AND artifact_namespace LIKE '%ethereum%'
"""
df = client.to_pandas(query)
print(df)
query FindProjectsByArtifact {
oso_artifactsByProjectV1(
where: {
artifactNamespace: { _ilike: "%ethereum%" }
artifactSource: { _eq: "GITHUB" }
}
) {
projectId
projectName
artifactNamespace
artifactName
}
}
Getting Metrics for a Project
- Python
- GraphQL
query = """
SELECT
tm.sample_date,
m.metric_name,
tm.amount,
tm.unit
FROM timeseries_metrics_by_project_v0 AS tm
JOIN metrics_v0 AS m
ON tm.metric_id = m.metric_id
JOIN projects_v1 AS p
ON tm.project_id = p.project_id
WHERE
p.project_name = 'ethereum'
AND m.metric_name IN (
'GITHUB_stars_daily',
'GITHUB_forks_daily',
'GITHUB_commits_daily',
'GITHUB_contributors_daily'
)
ORDER BY tm.sample_date DESC
LIMIT 10
"""
df = client.to_pandas(query)
print(df)
query GetProjectAndMetricIds($projectName: String!) {
oso_projectsV1(where: { projectName: { _eq: $projectName } }) {
projectId
}
oso_metricsV0(where: { metricName: { _in: [
"GITHUB_stars_daily",
"GITHUB_forks_daily",
"GITHUB_commits_daily",
"GITHUB_contributors_daily"
] } }) {
metricId
metricName
}
}
Variables:
{
"projectName": "ethereum"
}
And then get the timeseries data:
query GetProjectTimeseriesMetrics($projectId: String!, $metricIds: [String!]!) {
oso_timeseriesMetricsByProjectV0(
where: {
projectId: { _eq: $projectId }
metricId: { _in: $metricIds }
}
order_by: { sampleDate: Desc }
limit: 10
) {
sampleDate
metricId
amount
unit
}
}
Using the IDs from the previous query:
{
"projectId": "ethereum_project_id",
"metricIds": [
"GITHUB_stars_daily_metric_id",
"GITHUB_forks_daily_metric_id",
"GITHUB_commits_daily_metric_id",
"GITHUB_contributors_daily_metric_id"
]
}
Finding Collections and Their Projects
- Python
- GraphQL
query = """
SELECT
c.collection_id,
c.collection_name,
c.display_name,
p.project_id,
p.project_name,
p.display_name as project_display_name
FROM collections_v1 AS c
JOIN projects_by_collection_v1 AS pbc
ON c.collection_id = pbc.collection_id
JOIN projects_v1 AS p
ON pbc.project_id = p.project_id
WHERE
c.collection_name LIKE '%ethereum%'
ORDER BY c.collection_name, p.project_name
"""
df = client.to_pandas(query)
print(df)
query FindCollectionsAndProjects {
oso_collectionsV1(where: { collectionName: { _ilike: "%ethereum%" } }) {
collectionId
collectionName
displayName
}
oso_projectsByCollectionV1(
where: { collectionName: { _ilike: "%ethereum%" } }
order_by: { projectName: Asc }
) {
collectionId
collectionName
projectId
projectName
}
}