Write a Data Model in SQLMesh
This guide explains how to create and contribute data models to OSO using SQLMesh. It covers model structure (staging, intermediate, mart), naming conventions, materialization strategies, metrics logic, and testing workflows. Before proceeding, make sure you've completed the setup process and have a local SQLMesh environment running.
TL;DR
The OSO data model is complex, so you should read everything if you want to contribute! But here are the most important points:
- Always refer to entities (projects, users, etc.) using their unique entity IDs
- Raw data flows into staging models, which feed intermediate models, which feed mart models
- Folders don't matter but names do: use prefixes and descriptive names for models
- Event models always have a time component and a to/from entity component
- Event models are massive: it's important to partition, aggregate, and filter them efficiently
- Metrics are generated using the
metrics_tools
so they can be applied consistently across different entity types and time periods - We like long, skinny tables (many rows, few columns) to keep them performant
- SQL must be written in Trino; we have macros to help with operations that can be tricky in Trino's SQL
- Test locally before submitting a pull request
Core Concepts
OSO unifies data from different sources into consistent data models that can be queried and analyzed at scale. In order to contribute models, it's important to understand the core concepts and structures used in OSO.
Entities
Entities represent people, organizations, repositories, packages, and much more. Every entity has a unique ID comprised of a source, namespace, and name. These IDs are used to link entities across different data sources and models.
Some examples:
- A GitHub repo, e.g. "https://github.com/opensource-observer/oso", is assigned
@oso_entity_id('GITHUB', 'opensource-observer', 'oso')
- An address on Base, e.g., "0x1234...", is assigned
@oso_entity_id('BASE', '', '0x1234...')
- A project in (our version of) OSS Directory:
@oso_entity_id('OSS_DIRECTORY', 'oso', 'my-project')
It's extremely important to use these IDs consistently across models to ensure data integrity and consistency. The @oso_entity_id
macro is used to generate these IDs deterministically.
Events
Events represent actions or interactions that occur between entities. These can include things like commits, contract interactions, package imports, funding events, and more. Events are typically brought in from public databases via staging models.
We currently do not have a deterministic way to generate event IDs, but we are working on a solution to this problem. In the meantime, events are differentiated by their source, type, timestamp, and the to/from entities involved.
Some examples:
- A GitHub commit from a user to a repository
- A contract interaction from a user to a smart contract
- A dependency from a repo to an NPM package
Given that there are often billions of events associated with many of our sources, we typically aggregate these events into daily or monthly buckets for performance reasons.
Metrics
Metrics are essentially aggregations of events over time for a specific entity type. These can include things like unique users, forks, funding amounts, and more. Metrics are generated using the metrics_tools
package in SQLMesh so they can be applied consistently across different entity types and time periods. (Fun fact: this capability was one of the primary reasons we migrated our data pipeline to SQLMesh!)
In the future, we expect there to be many, many metrics. Therefore, we use a similar ID system to entities to ensure consistency and integrity, which is comprised of a source, namespace, and name. Metric names currently concatenate the event source, event type, and time period.
Some examples:
- GITHUB_releases_weekly, GITHUB_releases_daily, GITHUB_releases_over_all_time
- OPTIMISM_active_addresses_aggregation_daily, OPTIMISM_active_addresses_aggregation_over_90_day_period
- OPEN_COLLECTIVE_funding_received_monthly, OPEN_COLLECTIVE_funding_received_over_180_day_period
The power of this approach is that it allows us to easily compare metrics across different entity types and time periods, and to generate consistent cohorts and data visualizations. The (current) disadvantage is that you need to be precise when querying metrics to ensure you're getting the right data.
Model Structure
Hierarchy
Data models in OSO are organized in the warehouse/oso_sqlmesh/models
directory, following these categories:
- Staging (stg): Initial transformations that clean and standardize source data. There should be no joins or complex logic in staging models. A staging model should be able to run independently and feed everything downstream from it that requires the same source data.
- Intermediate (int): Models that join and transform staging models into more complex structures. Intermediate models may contain aggregations or other complex logic. If you in doubt about where to put a model, it should probably be an intermediate model.
- Mart: Final models that are exposed to end users, typically containing registries, metrics, or aggregagated event data. We aim to have as few columns as possible in mart models to keep them performant. Marts models have versions postfixed with
_v0
,_v1
, etc. Anything with a av0
is considered a development version and may be unstable.
In summary, staging models feed intermediate models, which feed mart models. When contributing new models, it's important to follow this structure to maintain consistency across the codebase.
Naming Conventions
Note that SQLMesh ignores folders, so the model name must be uniqe and should be descriptive enough to indicate its category. In addition, when naming your models, try to follow these conventions:
- Use the Correct Prefix: Prefix your model name with
stg_
orint_
to indicate the category. Mart models should not have prefixes but should end with a version number (eg,_v0
,_v1
) - Use Descriptive Names: Choose names that clearly indicate the model's purpose and source data.
- Use Underscores: Separate words in the model name with underscores for readability. Use two underscores to separate the category from the model name (eg,
stg_github__commits.sql
).
Materialization Strategies
SQLMesh supports several types of models with different materialization strategies:
- FULL: A complete table that is rebuilt from scratch on each run. This is the simplest type but can be inefficient for large datasets.
- INCREMENTAL_BY_TIME_RANGE: Models that process data in time-based chunks. These are ideal for event data where you want to process new time periods incrementally.
- INCREMENTAL_BY_UNIQUE_KEY: Models that perform upserts based on a unique key. These are useful for dimension tables or slowly changing data.
- INCREMENTAL_BY_PARTITION: Models that process data in discrete partitions. These are useful when data must be processed in static groupings like
event_source
or ID ranges.
Most models in OSO use either the FULL
or INCREMENTAL_BY_TIME_RANGE
strategy.
Incrementing and Partitioning
Large event models should be partitioned and processed incrementally in order to keep compute costs manageable. When doing this, it is important to ensure that any upstream models are also partitioned and incremental. If an incremental model depends on a non-incremental model that is changing frequently, then your model may have some data integrity issues (or will need to be rebuilt frequently, which defeats the purpose of making it incremental).
Here's an example of an intermediate model from the OSO codebase:
MODEL (
name oso.int_events_daily__github,
kind INCREMENTAL_BY_TIME_RANGE (
time_column bucket_day,
batch_size 365,
batch_concurrency 1
),
start @github_incremental_start,
cron '@daily',
partitioned_by (DAY("bucket_day"), "event_type"),
grain (bucket_day, event_type, event_source, from_artifact_id, to_artifact_id)
);
SELECT
DATE_TRUNC('DAY', time::DATE) AS bucket_day,
from_artifact_id::VARCHAR AS from_artifact_id,
to_artifact_id::VARCHAR AS to_artifact_id,
event_source::VARCHAR,
event_type::VARCHAR,
SUM(amount::DOUBLE)::DOUBLE AS amount
FROM oso.int_events__github as events
WHERE time BETWEEN @start_dt AND @end_dt
GROUP BY
DATE_TRUNC('DAY', time::DATE),
from_artifact_id,
to_artifact_id,
event_source,
event_type
Incremental Settings
In the above example, there are several key settings to note:
kind=INCREMENTAL_BY_TIME_RANGE
specifies that this is an incremental model based on timetime_column="bucket_day"
indicates which column contains the timestamp for incremental processing@github_incremental_start
is a global variable that defines the start date for incremental processing for this source@start_date
and@end_date
are automatically set by SQLMesh to the appropriate time range for each run
Partition Settings
The example above also includes a partitioned_by
tuple:
partitioned_by (DAY("bucket_day"), "event_type"),
The DAY
in DAY("bucket_day")
is Trino's date trunc syntax, i.e., ensuring the bucket_day
column is date truncated.
It's important to be thoughtful about how you partition your tables. In general, you should partition by the columns you most frequently filter on in your queries. If you make your partitions too granular, you may end up with too many small files, which can slow down queries. If you make them too large, you may not get the performance benefits of partition pruning.
Grain Settings
The grain
setting is used by SQLMesh to define the unique key(s) for the model. It is basically a composite key that ensures the model is unique at the grain level. This is important for ensuring data integrity and consistency across models. In the example above, the grain is defined as:
grain (bucket_day, event_type, event_source, from_artifact_id, to_artifact_id)
If you configure your grain right, then SQLMesh can intelligently determine what data needs to be updated in the model. If you get it wrong (e.g., you forget to include a column that is part of the unique key), then you may end up with data integrity issues. If you ignore the grain, then SQLMesh will assume you want to rebuild the entire model each time, which can be very inefficient for large models.