Skip to main content

Getting Started with SQLMesh

info

This guide will help you set up SQLMesh to contribute data models to OSO. SQLMesh is a powerful open-source SQL transformation framework that enables you to develop, test, and deploy SQL-based data transformations with confidence.

SQLMesh Overview

SQLMesh brings a version-controlled, plan-based approach to building data pipelines. It allows data teams to define transformations as SQL or Python models with a lightweight DSL, and manages the deployment of those models to your data warehouse or query engine in a consistent, reproducible way.

Key features of SQLMesh include:

  • Version-controlled snapshots: Each model run produces a versioned table (snapshot) in a physical schema, and SQLMesh exposes a virtual layer of views for each environment (like dev or prod) pointing to the correct snapshot.
  • Isolated development environments: Test changes in "zero-cost" virtual environments without copying data, and promote to production with confidence.
  • Semantic SQL understanding: SQLMesh parses your SQL with the SQLGlot library, providing a semantic understanding of your queries (not just treating them as strings). This means it can catch errors like syntax issues or missing columns at compile time.
  • True incremental processing: SQLMesh is incremental-first, keeping track of already-processed time ranges or keys and only processing new or changed data.
  • Built-in data consistency guards: SQLMesh tracks the intervals of data each model has and prevents data gaps or overlaps, ensuring data integrity.

Prerequisites

Before you begin, you'll need to have the following installed on your machine:

  • DuckDB - A high-performance analytical database system
  • Python 3.8 or higher
  • Git (to clone the OSO repository)

Installation

Install DuckDB

DuckDB is required for local development and testing of SQLMesh models.

macOS/Linux (using Homebrew):

brew install duckdb

Debian/Ubuntu (using APT):

sudo apt-get install duckdb

Clone the OSO Repository

If you haven't already, clone the OSO repository:

git clone https://github.com/opensource-observer/oso.git
cd oso

Set Up Environment Variables

Create or update your .env file at the root of the OSO repository with the following variables:

GOOGLE_PROJECT_ID=opensource-observer
SQLMESH_DUCKDB_LOCAL_PATH=/tmp/oso.duckdb

Authenticate with Google Cloud

SQLMesh needs access to Google Cloud to fetch data. Authenticate using:

gcloud auth application-default login

Install Dependencies

Install the required Python dependencies:

uv sync
source .venv/bin/activate

Initialize Local Data

To work with SQLMesh locally, you need to download playground data into your local DuckDB instance:

oso local initialize

This command will download the full dataset, which might be large. For a smaller sample, you can limit the data:

oso local initialize --max-results-per-query 10000 --max-days 7

Or using shortcuts:

oso local initialize -m 10000 -d 7

This will download 7 days of time series data with an approximate maximum of 10,000 rows in each table that is not time series defined.

Running SQLMesh

Test Your Setup

To verify that everything is set up correctly, run SQLMesh for a sample date range:

oso local sqlmesh plan dev --start 2025-03-01 --end 2025-03-31

This command uses a wrapper oso local sqlmesh that automatically uses the correct warehouse/oso_sqlmesh directory.

Explore the Data

You can explore the data in DuckDB using the command line:

duckdb

Or specify the database path:

duckdb /tmp/oso.duckdb

Once in the DuckDB shell, you can see the available tables:

SHOW ALL TABLES;

And execute sample queries:

SELECT * FROM metrics__dev.metrics_v0 LIMIT 5;

Testing Your Models

When developing new models, you can test them against your local DuckDB instance:

sqlmesh plan dev

For faster testing, you can specify a shorter date range:

sqlmesh plan dev --start 2024-12-01 --end 2024-12-31

Adding New Data Sources

If a source that's in BigQuery is missing from DuckDB, you can add it to the initialize_local_duckdb function in warehouse/metrics_tools/local/utils.py.

Add new models as TABLE_MAPPING fields, for example:

"opensource-observer.farcaster.profiles": "bigquery.farcaster.profiles",

Then reference the model in your SQLMesh code via @oso_source('YOUR_MODEL').

Important: Whenever you add a new source, you will need to re-initialize your local database.

Advanced Testing with Trino

While DuckDB is recommended for most development work, you can also test your models against a local Trino instance for more production-like validation.

This is the recommended way to test on Trino. It requires more resources than DuckDB but provides a more accurate representation of the production environment.

Prerequisites

Follow the instructions in the OSO repository to set up and run Trino with Docker Compose.

Using Kubernetes with Kind

For advanced users who need to test with Kubernetes, you can set up a local Trino cluster using Kind.

Prerequisites

Setup

Initialize the local Trino cluster on Kubernetes:

oso ops cluster-setup

This can take approximately 10 minutes to complete. Once set up, you can check if Kind has started all your pods:

kubectl get pods --all-namespaces
kubectl get kustomizations --all-namespaces

Wait for all pods to come online, particularly local-trino-psql.

Initialize Trino Data

Initialize the local Trino with a smaller dataset:

oso local initialize --local-trino -m 1000 -d 3

Running SQLMesh with Trino

To run SQLMesh with your local Trino instance:

oso local sqlmesh --local-trino plan

The --local-trino option should be passed before any SQLMesh arguments.

Debugging Trino

You can expose your local Trino instance for debugging:

kubectl port-forward --namespace=local-trino service/local-trino-trino 8080:8080 --address 0.0.0.0

This will open a web server to interact with Trino directly at http://127.0.0.1:8080.

Next Steps

Now that you have SQLMesh set up, you're ready to start contributing data models to OSO. Check out the Data Models guide to learn how to create and contribute models.