OSO Service
The OSO team aims to provide reliable infrastructure as a public good for the community to build upon. This document should not be treated as a guarantee of service or service-level agreement (SLA). Rather, this document should serve as a reference and shared target that the community aims to achieve together.
Service Status
You can find our status page here:
https://www.opensource.observer/status
Service Definition
OSO is constantly experimenting with new technology and services to improve our infrastructure. We only aim to provide service availability for the following mature services:
OSO Data
In addition to services that we maintain, OSO provides data that is not covered directly by the above services. These data assets are available for querying and usage within the warehouse and are materialized using dagster or sqlmesh. We categorize this data into three main types
At this time the dagster assets aren't properly grouped. However, once they are we should use the global asset lineage as the source of truth for asset health.
- Core Source Data
- Project related data (oss-directory, op-atlas)
- Github data
- Superchain data
- Anything labeled
core
- Unstable Source Data
- Some external 3rd party data that is not guaranteed to be reliable and may change without notice.
- This is generally experimental data or data that is not necessary to be up to date.
- SQLMesh
- SQLMesh assets are data assets derived from executing sqlmesh models that depend on source data.
A note about data provenance and unstable source data
When querying the OSO data warehouse, it is possible to trace the lineage of data assets back to their source. This is important for understanding the context and reliability of the data being used. If the data source being used is a known "unstable" source, it will be marked as such in the lineage information.
Service Level Objectives
The following are the internal service level objectives (SLOs) for OSO. These are not guarantees of service, but rather targets for the team to strive towards.
- Core Source Data
- Description
- Core data are assets that are necessary for the OSO warehouse to function effectively.
- Scope
- All assets labeled
core
and not including sqlmesh assets.
- All assets labeled
- Metrics
- Materialization Reliability:
- Value
- 99.0% success rate for 30 days
- Measurement interval
- 30 days
- Description
- After retries, we should not experience more than 99.0% failure rate for processes that materialize data.
- Value
- Data freshness:
- Value
- Data should be fresh within some time interval at all times. For partitioned data, 99.0% of partitions should be materialized.
- Measurement interval
- Point in time
- Description
- At any given time, each core data should have a freshness rate within some specified time interval. Freshness is determined by the time since the last materialization, or by the success data audits for a given asset. The time interval should be defined by each asset in the asset's configuration. These should be expressed as a time delta.
- Value
- Materialization Reliability:
- Description
- Unstable Source Data
- Description
- Unstable source data are assets that are not critical to the function of the OSO warehouse. They are considered potentially unreliable and have a significantly lower service level objective. At this time the only SLOs for this data type are related to the infrastructure that supports it. The data itself has no guarantees. If any given data asset in this category of data falls below the established SLOs, it may be subject to removal from the platform.
- Scope
- All assets without a
core
label and not including sqlmesh assets
- All assets without a
- Metrics
- Materialization Reliability:
- Value
- 75.0% success rate for 30 days
- Measurement interval
- 30 days
- Description
- After retries, we should not experience more than 85.0% failure rate for processes that materialize data.
- Value
- Materialization Reliability:
- Description
- SQLMesh
- Scope
- The sqlmesh assets
- Metrics
- Materialization Reliability:
- Value
- 99.0% success rate
- Measurement interval
- 30 days
- Description
- Regardless of retries, we should not experience more than 99.0% failure rate for sqlmesh.
- Value
- Materialization Reliability:
- Other notes
- SQLMesh data audits are all-or-nothing, so they must pass or a sqlmesh materialization will fail.
- Scope
- Public Applications
- Scope
- Metrics
- Availability
- Value
- 99.0% availability
- Measurement interval
- 30 days
- Description
- Services should be available 99.0% of the time.
- Value
- Requests
- Value
- 99.0% request success rate
- Measurement interval
- 30 days
- Description
- 99.0% of all requests should not result in an error.
- Value
- Availability
- Consumer Trino
- Scope
- Public facing trino (
opensource.observer/api/v1/sql
)
- Public facing trino (
- Metrics
- Availability
- Value
- 99.0% availability
- Measurement interval
- 30 days
- Description
- Services should be available 99.0% of the time.
- Value
- Queries
- Value
- 90.0% query success rate
- Measurement interval
- 30 days
- Description
- 90.0% of all queries should not result in an unexpected error. User errors are ignored in this measurement.
- Value
- Availability
- Scope
- Kubernetes Infrastructure
- Scope
- Kubernetes clusters and resources. This is an internal only resource, but it is critical for the operation of our services. The availability of this infrastructure directly impacts the reliability of our services.
- Metrics
- Availability
- Value
- 99.0% availability
- Measurement interval
- 30 days
- Description
- Kubernetes infrastructure should be available 99.0% of the time.
- Value
- Flux
- Value
- 99.0% availability
- Measurement interval
- 30 days
- Description
- Flux should be available 99.0% of the time.
- Value
- Availability
- Scope
SLO Policy
Any service that does not meet its SLOs will trigger a remedial sprint to address the issues and bring the service back into compliance with the SLOs. Remedial sprints will continue until the service or data is back in compliance with the SLO.
Outage Escalation
The best way to get in touch with the team is via Discord.