Skip to main content

Access Public Datasets

OSO shares every dataset on BigQuery Analytics Hub. To get started, set up your BigQuery account.

OSO Data Exchange on Analytics Hub

To explore all the OSO datasets available, check out our BigQuery data exchange.

OSO Production Data Pipeline

Subscribe on BigQuery

Every stage of the OSO data pipeline is queryable and downloadable. Like most dbt-based pipelines, we split the pipeline stages into staging, intermediate, and mart models.

You can find the reference documentation on every data model on https://models.opensource.observer/

OSO Mart Models

These are the final product from the data pipeline, which is served from our API.

For example, you can get a list of oss-directory projects

select
project_id,
project_name,
display_name,
description
from `YOUR_PROJECT_NAME.oso_production.projects_v1` LIMIT 10

or code metrics by project.

select *
from `YOUR_PROJECT_NAME.oso_production.code_metrics_by_project_v1`
where project_name = 'uniswap'

Remember to replace 'YOUR_PROJECT_NAME' with the name of your project in the query.

Note: Unless the model name is versioned, expect that the model is unstable and should not be depended on in a live production application.

OSO Staging / Intermediate Models

From source data, we produce a "universal event table", currently stored at int_events. Each event consists of an event_type (e.g. a git commit or contract invocation), to/from artifacts, a timestamp, and an amount.

From this event table, we aggregate events in downstream models to produce our metrics. For example, you may find it cheaper to run queries against int_events_daily_to_project.

SELECT event_source, SUM(amount)
FROM `YOUR_PROJECT_NAME.oso_production.int_events_daily_to_project`
WHERE project_id = 'XSDgPwFuQVcj57ARcKTGrm2w80KKlqJxaBWF6jZqe7w=' AND event_type = 'CONTRACT_INVOCATION_DAILY_COUNT'
GROUP BY project_id, event_source

OSO Playground

Subscribe on BigQuery

We maintain a subset of projects and events in a playground dataset for testing and development. All of the production models are mirrored in this environment.

Source Data

GitHub Data

View on BigQuery

GitHub data is predominantly provided by the incredible GH Archive project, which publishes a public archive of historical events to GitHub.

For example, to count the number of issues opened, closed, and reopened on 2020/01/01:

SELECT event as issue_status, COUNT(*) as cnt FROM (
SELECT type, repo.name, actor.login,
JSON_EXTRACT(payload, '$.action') as event,
FROM `githubarchive.day.20200101`
WHERE type = 'IssuesEvent'
)
GROUP by issue_status;

Open Source Insights

View on BigQuery

Open Source Insights (aka deps.dev) is a service developed and hosted by Google to help developers better understand the structure, construction, and security of open source software packages. The service examines each package, constructs a full, detailed graph of its dependencies and their properties, and makes the results available to anyone who could benefit from them. The goal is to provide developers with a picture of how their software is put together, how that changes as dependencies change, and what the consequences might be.

For example, to determine which packages have the most dependents:

DECLARE
Sys STRING DEFAULT 'CARGO';

SELECT
Name,
Version,
FROM (
SELECT
Name,
Version,
ROW_NUMBER()
OVER (PARTITION BY
Name
ORDER BY
VersionInfo.Ordinal DESC) AS RowNumber
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
WHERE
System = Sys
AND VersionInfo.IsRelease)
WHERE RowNumber = 1;

Ethereum Data

View on BigQuery

The Google Cloud team maintains a public Ethereum dataset. This is backed by the ethereum-etl project.

For example, to get 10 transactions from the latest block

select
`hash`,
block_number,
from_address,
to_address,
value,
gas,
gas_price
from `bigquery-public-data.crypto_ethereum.transactions` as transactions
order by block_number desc
limit 10

Optimism Data

Subscribe on BigQuery

All tables are partitioned by date (dt) and clustered by chain name (chain). Data is updated daily with an approximate latency of 24 hrs. Data is sourced from Goldsky using the Mirror platform, which fetches and delivers onchain data from RPC nodes to ClickHouse. From there, the OP Labs team runs an incremental ingestion process based on block number.

For example, to get all transactions to the 4337 EntryPoint contracts for all OP Stack chains:

  select
dt,
chain,
count(*) as txns
from `optimism_superchain_raw_onchain_data.transactions`
where
dt >= '2024-10-01'
and to_address in (
"0x5ff137d4b0fdcd49dca30c7cf57e578a026d2789",
"0x0000000071727de22e5e9d8baf0edac6f37da032"
)
group by dt, chain
order by dt, chain

Arbitrum Data

Subscribe on BigQuery

Arbitrum is a Layer 2 scaling solution for Ethereum. We currently provide blocks, transactions, and traces for Arbitrum's flagship network, Arbitrum One. This public dataset is maintained by our partners at Goldsky.

For example, to get the average gas per transaction for each block on Arbitrum One on September 1st, 2024:

select
number,
gas_used / transaction_count as gas_per_txn
from `YOUR_PROJECT_NAME.arbitrum_one.arbitrum_blocks`
where timestamp between '2024-09-01' and '2024-09-02'
order by number

Remember to replace 'YOUR_PROJECT_NAME' with the name of your project in the query.

Filecoin Data

Subscribe on BigQuery

Filecoin is a decentralized storage network designed to store humanity's most important information. This dataset mirrors the dataset offered by Lily for use in the OSO data pipeline. It includes storage deals, miners, FVM transactions, and much more.

For example, this is how to get how many messages were sent to the network in the last month:

select
count(cid) as total_messages
from `YOUR_PROJECT_NAME.filecoin_lily.parsed_messages`
where
cast(timestamp_seconds((height * 30) + 1598306400) as date) > current_date() - interval 1 month

Farcaster Data

Subscribe on BigQuery

Farcaster is a decentralized social network built on Ethereum. This dataset mirrors the dataset offered by Indexing for use in the OSO data pipeline. It includes casts, links, reactions, verifications, and profiles.

For example, to get the users with the most lifetime reactions:

SELECT
r.target_cast_fid as fid,
json_value(p.data, "$.display") as display_name,
COUNT(*) as reaction_count
FROM `YOUR_PROJECT_NAME.farcaster.reactions` as r
LEFT JOIN `YOUR_PROJECT_NAME.farcaster.profiles` as p ON r.target_cast_fid = p.fid
GROUP BY fid, display_name
ORDER BY reaction_count DESC

Here's another use case, showing how to derive all the verified Ethereum addresses owned by a Farcaster user:

WITH
profiles AS (
SELECT
v.fid,
v.address,
p.custody_address,
JSON_VALUE(p.data, "$.username") AS username,
FROM `YOUR_PROJECT_NAME.farcaster.verifications` v
JOIN `YOUR_PROJECT_NAME.farcaster.profiles` p ON v.fid = p.fid
WHERE v.deleted_at IS NULL
),
eth_addresses AS (
SELECT
fid,
username,
address
FROM profiles
WHERE LENGTH(address) = 42
UNION ALL
SELECT
fid,
username,
custody_address AS address
FROM profiles
)
SELECT DISTINCT
fid,
username,
address
FROM eth_addresses

Remember to replace 'YOUR_PROJECT_NAME' with the name of your project in the query.

Lens Data

Subscribe on BigQuery

Lens Protocol is an open social network. This dataset mirrors the dataset offered by Lens for use in the OSO data pipeline. It includes data from the Polygon network.

Gitcoin and Passport Data

Subscribe on BigQuery

Gitcoin is the hub of grantmaking in the Ethereum ecosystem. All project, round, and donation data from regendata.xyz is available in this dataset and updated daily.

You can also access the raw Gitcoin data directly on BigQuery:

select * from `YOUR_PROJECT_NAME.gitcoin.all_donations`

and:

select * from `YOUR_PROJECT_NAME.gitcoin.all_matching`

You can also get the total amount of donations mapped to projects in OSO directly from one of our data marts:

select
round_number,
oso_display_name,
sum(amount_in_usd) as total_funding_usd
from `YOUR_PROJECT_NAME.oso_production.gitcoin_funding_events_by_project_v0`
where oso_project_name = 'opensource-observer'
group by 1,2
order by round_number

Passport is a web3 identity verification protocol. OSO and Gitcoin have collaborated to make this dataset of address scores available for use in understanding user reputations.

For example, you can can vitalik.eth's passport score:

select
passport_address,
last_score_timestamp,
evidence_rawScore,
evidence_threshold,
from YOUR_PROJECT_NAME.gitcoin.passport_scores
where passport_address = '0xd8da6bf26964af9d7eed9e03e53415d37aa96045'

Remember to replace 'YOUR_PROJECT_NAME' with the name of your project in the query.

OpenRank Data

Subscribe on BigQuery

OpenRank is a reputation protocol that enables verifiable compute for a large class of reputation algorithms, in particular those that (a) operate on a graph, (b) are iterative in nature and (c) have a tendency toward convergence. Such algorithms include EigenTrust, Collaborative Filtering, Hubs and Authorities, Latent Semantic Analysis, etc.

In this dataset, Farcaster users reputations are scored in 2 ways:

  • With globaltrust, we calculate global reputation scores, seeded by the trust of Optimism badgeholders.
  • With localtrust, you can get reputation scores of other users relative to specified user.

For example, you can get the globaltrust reputation score of vitalik.eth

select
strategy_id,
i,
v,
date
from YOUR_PROJECT_NAME.karma3.globaltrust
where i = 5650

Remember to replace 'YOUR_PROJECT_NAME' with the name of your project in the query.

Ethereum Attestation Service

Subscribe on BigQuery

Ethereum Attestation Service (EAS) is an infrastructure public good for making attestations onchain or offchain about anything.

This dataset mirrors the dataset offered by EAS for use in the OSO data pipeline, currenty showing all attestations on the Optimism network.

For example, you can get a list of all * official * RetroPGF badgeholders:

select
recipient as badgeholder,
json_value(decoded_data_json, "$[0].value.value") as rpgfRound
from YOUR_PROJECT_NAME.ethereum_attestation_service_optimism.attestations
where
revoked = False
and schema_id = '0xfdcfdad2dbe7489e0ce56b260348b7f14e8365a8a325aef9834818c00d46b31b'
and attester = '0x621477dBA416E12df7FF0d48E14c4D20DC85D7D9'

Remember to replace 'YOUR_PROJECT_NAME' with the name of your project in the query.

Open Collective Data

Subscribe on BigQuery

Open Collective is a platform for transparent finances and governance for open source projects.

The Open Collective datasets contains all transactions realized on the platform since its inception. Separate datasets are available for expenses and deposits.

For example, you can get the total amount of donations in USD made to the pandas project:

select
SUM(CAST(JSON_VALUE(amount, "$.value") as FLOAT64)) as total_amount,
from
YOUR_PROJECT_NAME.open_collective.deposits
where
JSON_VALUE(amount, "$.currency") = "USD"
and JSON_VALUE(to_account, "$.id") = "ov349mrw-gz75lpy9-975qa08d-jeybknox"

Remember to replace 'YOUR_PROJECT_NAME' with the name of your project in the query.

Subscribe to a dataset

1. Data exchange listings

For datasets listed on the OSO public data exchange, click on the "Subscribe on BigQuery" button to create a new dataset that is linked to OSO.

subscribe

This has a few benefits:

  • Data is automatically kept live and real-time with OSO
  • You keep a reference to the data in your own GCP project
  • This gives OSO the ability to track public usage of models

2. Direct access to datasets

For datasets without a listing on the OSO public data exchange, we make the dataset open to public queries for direct queries. Click on the "View on BigQuery" button to go straight to the dataset.

You can star the dataset to keep it in your project.

star

Next steps

Once you've subscribed our datasets to your own Google project, you can start to run queries and analyses with a variety of tools.

Here are a few to start: