Skip to main content

Semantic Layer

info

This guide is for actually defining the semantic layer of the data warehouse. If you're just looking to make queries you should see the guide here

The semantic layer

The OSO semantic layer allows us to provide a detailed interface to the OSO data model that encapsulates the contextual meaning of tables, columns or other abstractions in the data warehouse. The hope is that this interface will allow new users an easy way to interrogate the data while also providing for convenience tools when creating new models for seasoned users.

Anatomy of the semantic layer

The semantic layer provides an interface that takes a lot of inspiration from the wonderful work at cube.dev. Many of the same abstractions are used to provide a similar vocabulary for those familiar with that tool. These abstractions and components are as such:

  • Model
    • A model is a given object abstraction in for a given data type in the OSO data warehouse. At this time, it is expected that there's a single canonical table to represent a given Model. Models, like any object can relate to each other. For instance, the OSO data model has the concept of an Artifact which is part of zero or more Projects, and Projects which are part of zero or more Collections. Each of those entities are related to their associated Models.
  • Dimension
    • A dimension is a non-aggregated attribute of a model. This could be something like an Artifact name, a Project namespace, a Collection description, an Event type, or Event time.
  • Measure
    • A measure is an aggregated attribute of a model that is usually used to summarize values related to a model. This could be something like, the count of Artfiacts or sum of the amount dimension of Events.
  • Relationship
    • A relationship defines a relationship is a special attribute that defines the relationship between one model and another. In sql, this is modelled as some kind of foreign key.
  • Interface
    • Much like an interface in an object oriented language, an interface is a generic interface that can be applied to a given model. This allows us to provide
  • Registry
    • A registry is the directory of all the models and their relationships to each other. Without the registry it's impossible to create a valid sql query.

Defining the semantic layer

The semantic layer definitions are all JSON serializable Python objects. These all represent objects detailed in the Anatomy of the Semantic Layer section above. This section will walk through the definition of the artifact, project, and collection models. We will start with a very basic definition of the artifact model and expand it to include more attributes and relationships.

OSO artifacts, projects and collections

Before we get to defining a semantic layer for these three entities, we need to understand the meaning these entities hold in the context of OSO. Inside the OSO data warehouse, artifacts, projects, and collections are generic entities to understand the events that occur in the universe of open source software. These entities roughly map to certain classes of real world or digital entities.

  • Artifact entities
    • An artifact is generally an real or digital object that we consider the smallest atom that can send or receive events. This could be an NPM package, a Github repository, an Optimism contract, an Ethereum address, or any other entity that can be interacted with that is generally indivisible.
  • Project entities
    • A project is a collection of artifacts that are related to each other in a some way. Usually, this project relates to some real world phyiscal or logical grouping of entities in the real world. In general, but not always, a project "owns" a set of artifacts rather than simply being composed of them. Examples of a project include an organization, a company, a team or even a single person.
  • Collection entities
    • A collection is an arbitrary grouping of projects. The relationship between collections and projects is not necessarily one of ownership. For instance, a collection could be a set of projects that are related to a specific topic, or perhaps they're a set of projects involved in the same event. A collection could also be a set of projects that simply relate to another project. We see collections as a flexible way to group related projects but each collection may have a different purpose.

An artifact model with dimensions

The model is the core abstraction of the semantic layer. It represents an object in the OSO data warehouse that corresponds to a specific table. This model then has a set of attributes that are one of Dimension, Measure, or Relationship. Defining a model is done by creating a Model object with the appropriate attributes.

To start, let's define a very basic model for the artifact table. This definition will be incomplete but it will give us a starting point to understand how to define a model.

from oso_semantic import Model, Dimension

artifact_model = Model(
name="artifact",
table="iceberg.oso.artifacts_v1",
dimensions=[
Dimension(
name="id",
description="Unique identifier for the artifact",
column_name="artifact_id",
),
Dimension(
name="name",
description="Name of the artifact",
column_name="artifact_name",
),
Dimension(
name="namespace",
description="Namespace of the artifact",
column_name="artifact_namespace"
),
Dimension(
name="source",
description="Source of the artifact",
column_name="artifact_source"
),
Dimension(
name="url",
description="URL of the artifact",
column_name="artifact_url"
),
]
)

At its most basic, a model must have a name, a table, and one or more attributes. Here we've defined some basic dimensions for the artifact model. Let's query this model to see what it looks like when translated to SQL. For this we must first register this model to a registry and then create a query against it.

To create the registry:

from oso_semantic import Registry

registry = Registry()
registry.register(artifact_model)

To query, we use the select method on the registry, which creates a QueryBuilder object that allows us to build a query against the registered models.

from oso_semantic import QueryBuilder

query: QueryBuilder = registry.select(
"artifacts.id",
"artifacts.name"
)

print(query.sql(pretty=True)) # Pretty print the SQL query

When querying, the select method simply takes a list of attributes that we want to query. These attributes are specified in the format <model_name>.<attribute_name>. In this case, we are querying the id and name attributes of the artifact model. In the final line, we call query.sql(pretty=True) and print the generated sql query to the console. This will output as follows:

note

The sql generated by the semantic layer will actually be a bit more verbose. This example is simplified to show the basic structure of the query.

SELECT
artifact.artifact_id as artifact__id,
artifact.artifact_name as artifact__name
FROM iceberg.oso.artifacts_v1 as artifact
GROUP BY 1, 2

From this query you will notice that the tool is translating dimension names to actual columns names in the underlying table. The artifact.artifact_id is the column name in the iceberg.oso.artifacts_v1 table that corresponds to the id dimension of the artifact model.

Dimensions can also be filtered on. Let's filter only for artifacts that are from the GITHUB source. We can do this by adding a filter to the query:

query_with_filter: QueryBuilder = registry.select(
"artifact.id",
"artifact.name",
).where(
"artifact.source = 'GITHUB'"
)
print(query_with_filter.sql(pretty=True))

The where method allows us to add a filter to the query using sql expressions. The expressions look very similar to standard sql but instead of columns being referenced, we use the model's attributes for filtering. The output of this query will look like this:

SELECT
artifact.artifact_id as artifact__id,
artifact.artifact_name as artifact__name
FROM iceberg.oso.artifacts_v1 as artifact
WHERE artifact.artifact_source = 'GITHUB'
GROUP BY 1, 2

Adding measures to the artifact model

Dimensions are useful for querying specific attributes of a model, but usually we also want to perform some kind of aggregation on the data. This is where measures come in. Measures are attributes that represent some kind of aggregation of the data in the model. For instance, we might want to know the count of artifacts or the distinct count of artifacts.

Here's an updated definition of the artifact model that includes measures and a more complete set of dimensions:

from oso_semantic import Model, Dimension, Measure

artifact_model = Model(
name="artifact",
table="iceberg.oso.artifacts_v1",
primary_key="artifact_id",
dimensions=[
Dimension(
name="id",
description="Unique identifier for the artifact",
column_name="artifact_id",
),
Dimension(
name="name",
description="Name of the artifact",
column_name="artifact_name",
),
Dimension(
name="namespace",
description="Namespace of the artifact",
column_name="artifact_namespace"
),
Dimension(
name="source",
description="Source of the artifact",
column_name="artifact_source"
),
Dimension(
name="url",
description="URL of the artifact",
column_name="artifact_url"
),
],
measures=[
Measure(
name="count",
description="Count of artifacts",
query="count(self.id)",
)
],
)

Now we have a more complete model that also includes the measure count of artifacts. You will notice, that unlike a dimension, a measure does not have a column_name attribute. Instead, it has a query attribute that defines how the measure is calculated. In this case, the count measure is defined as count(self.id). Measures may only reference attributes of models. Related models can also be referenced (covered later) but to prevent any ambiguity, the semantic layer provides a special self keyword that resolves to the current model instance.

If we query this model again, we can see how the measures are included in the SQL query:

# Recreate the registry with the updated model
registry = Registry()
registry.register(artifact_model)

# Query the model with the measure.
# This query will select the count of artifacts in a given source
count_query: QueryBuilder = registry.select(
"artifact.source",
"artifact.count"
)

print(count_query.sql(pretty=True))

The output of this query will look like this:

SELECT
artifact.artifact_source as artifact__source,
count(artifact.artifact_id) as artifact__count
FROM iceberg.oso.artifacts_v1 as artifact
GROUP BY 1

Much like, dimensions, measures can also be filtered on. For instance, if we wanted to only return artifact sources that have more than 100 artifacts, we could do the following:

count_query_with_filter: QueryBuilder = registry.select(
"artifact.source",
"artifact.count"
).where(
"artifact.count > 100"
)
print(count_query_with_filter.sql(pretty=True))

The generated SQL query will look like this:

SELECT
artifact.artifact_source as artifact__source,
count(artifact.artifact_id) as artifact__count
FROM iceberg.oso.artifacts_v1 as artifact
GROUP BY 1
HAVING count(artifact.artifact_id) > 100

Adding relationships to the artifact model

Relationships are used to define how models relate to each other. This allows the semantic layer to provide automatic joins between models when querying. In order for this to work, we need to define the relationships between models and classify them as either one_to_one, one_to_many or many_to_one. Many to many relationships are not explicitly modeled in the semantic layer but can be represented by a join table that is defined as a model.

One potentially unintuitive aspect of the semantic layer is that relationships are defined in a single direction and this is strictly enforced by the model registry. This allows us to treat the entire semantic layer as a directed acyclic graph. References within a model can happen in any direction but we force the definition to define a direction of the relationship. This is currently a limitation of the semantic layer and could change in the future.

Let's add a model to represent the project entity called project and add the relationships between the artifact and project models. Despite roughly having a constraint that an artifact may only belong to a single project, the data model is flexible enough to allow an artifact to be related to many projects. For this reason, we model the artifact and project relationship as a many to many relationship. This means we will also model the join table as a model in the semantic layer.

from oso_semantic import Model, Dimension, Measure, Relationship

# Let's redefine the registry and include the models without using
# intermediate variables
registry = Registry()


registry.register(Model(
name="projects",
description="A project is a collection of artifacts",
primary_key="project_id",
dimensions=[
Dimension(
name="project_id",
description="Unique identifier for the project",
column_name="project_id",
),
Dimension(
name="project_name",
description="Name of the project",
column_name="project_name",
),
Dimension(
name="project_description",
description="Description of the project",
column_name="project_description",
),
],
))

register.register(Model(
name="artifacts_by_project",
table="iceberg.oso.artifacts_by_project_v1",
primary_key="artifact_id",
description="Join table between artifacts and projects",
dimensions=[
Dimension(
name="artifact_id",
description="Unique identifier for the artifact",
column_name="artifact_id",
),
Dimension(
name="project_id",
description="Unique identifier for the project",
column_name="project_id",
),
],
relationships=[
Relationship(
name="project",
ref_model="project",
type="many_to_one",
source_foreign_key="project_id",
ref_key="project_id",
)
]
))

registry.register(Model(
name="artifacts",
table="iceberg.oso.artifacts_v1",
primary_key="artifact_id",
dimensions=[
Dimension(
name="artifact_id",
description="Unique identifier for the artifact",
column_name="artifact_id",
),
Dimension(
name="artifact_name",
description="Name of the artifact",
column_name="artifact_name",
),
Dimension(
name="artifact_namespace",
description="Namespace of the artifact",
column_name="artifact_namespace"
),
Dimension(
name="artifact_source",
description="Source of the artifact",
column_name="artifact_source"
),
Dimension(
name="artifact_url",
description="URL of the artifact",
column_name="artifact_url"
),
],
measures=[
Measure(
name="count",
description="Count of artifacts",
query="count(self.id)",
),
],
relationships=[
Relationship(
name="by_project",
description="Relationship to the artifacts_by_project model",
ref_model="artifacts_by_project",
type="many_to_one",
source_foreign_key="artifact_id",
ref_key="artifact_id",
)
]
))

In this we have now defined the relationships between the artifacts and the projects by mapping how the reference in one model translates to the referenced model. The relationships themselves are not defined using model objects explicity but rather reference models by names. This allows a more flexible way to define each of the models. Once you start querying the models, however, the registry is treated as immutable and the relationships are validated to ensure that the model and any relationships are valid and do not have cycles.

Now if we query for both artifacts model attributes and projects model attributes, the semantic layer will automatically join the two models together:

query = registry.select(
"artifacts.name",
"projects.name",
)
print(query.sql(pretty=True))

The output of this query will look like this:

SELECT
artifacts_12345678.artifact_name as artifact__name,
projects_12345678.project_name as project__name
FROM iceberg.oso.artifacts_v1 as artifacts_12345678
LEFT JOIN iceberg.oso.artifacts_by_project_v1 as artifacts_by_project_1b23c4d
ON artifacts_12345678.artifact_id = artifacts_by_project_1b23c4d.artifact_id
LEFT JOIN iceberg.oso.projects_v1 as projects_12345678
ON artifacts_by_project_1b23c4d.project_id = projects_12345678.project_id
GROUP BY 1, 2