Subscribe to Source Datasets on BigQuery
OSO curates a variety of public datasets from the open source ecosystem. Check out our BigQuery data exchange to explore available data. If you don't have a BigQuery account, start here.
Note: If you want to explore data and metrics computed in the OSO data pipeline, we recommend that you use pyoso instead.
Here are some of our featured public datasets:
OSO Production Pipeline
GitHub Archive

Open Source Insights

Ethereum ETL

Superchain

Superchain 4337

Arbitrum One

Filecoin Lily
Farcaster

Lens

Gitcoin

Open Collective
OP Atlas

OpenRank

EAS
Open Labels Initiative
OSO Production Pipeline
These are the final product from the data pipeline, which is served from our API.
For example, you can get a list of oss-directory projects
SELECT
project_id,
project_name,
display_name,
description
FROM `YOUR_PROJECT_NAME.oso_production.projects_v1` LIMIT 10
or all contracts on Base deployed on 2025-01-01.
SELECT count(*) as cnt
FROM `YOUR_PROJECT_NAME.oso_production.contracts_v0`
WHERE contract_namespace = 'BASE' AND deployment_date = '2025-01-01'
Note: Unless the model name is versioned with a v1
or higher, expect that the model is unstable and should not be depended on
in a live production application.
Some of our most popular v1 models include:
projects_v1
: a list of all projects in the OSO data pipelineartifacts_by_project_v1
: a list of all artifacts owned by a given projectprojects_by_collection_v1
: a list of all projects in a given collection
Some of the v0 (WIP) models that we include in analysis but may be a bit unstable:
contracts_v0
: a list of all discovered contracts, downstream from their deployers and/or factoriesmetrics_v0
: a list of all metrics available in OSOtimeseries_metrics_by_project_v0
: a list of all metrics for a given project
External Datasets
GitHub Archive
- Code License: MIT.
- Data governed by the GitHub terms of service.
- Updated hourly
GitHub data is predominantly provided by the incredible GH Archive project, which publishes a public archive of historical events to GitHub.
For example, to count the number of issues opened, closed, and reopened on 2020/01/01:
SELECT event as issue_status, COUNT(*) as cnt FROM (
SELECT type, repo.name, actor.login,
JSON_EXTRACT(payload, '$.action') as event,
FROM `githubarchive.day.20200101`
WHERE type = 'IssuesEvent'
)
GROUP by issue_status;
Open Source Insights

- Reference documentation
- Data License: CC-BY 4.0
Open Source Insights (aka deps.dev) is a service developed and hosted by Google to help developers better understand the structure, construction, and security of open source software packages. The service examines each package, constructs a full, detailed graph of its dependencies and their properties, and makes the results available to anyone who could benefit from them. The goal is to provide developers with a picture of how their software is put together, how that changes as dependencies change, and what the consequences might be.
For example, to determine which packages have the most dependents:
DECLARE
Sys STRING DEFAULT 'CARGO';
SELECT
Name,
Version,
FROM (
SELECT
Name,
Version,
ROW_NUMBER()
OVER (PARTITION BY
Name
ORDER BY
VersionInfo.Ordinal DESC) AS RowNumber
FROM
`bigquery-public-data.deps_dev_v1.PackageVersionsLatest`
WHERE
System = Sys
AND VersionInfo.IsRelease)
WHERE RowNumber = 1;
Ethereum

- Code License: MIT.
The Google Cloud team maintains a public Ethereum dataset. This is backed by the ethereum-etl project.
For example, to get 10 transactions from the latest block
SELECT
`hash`,
block_number,
from_address,
to_address,
value,
gas,
gas_price
FROM `bigquery-public-data.crypto_ethereum.transactions` AS transactions
ORDER BY block_number DESC
LIMIT 10
Superchain

- Data on every OP Stack chain, maintained by OP Labs PBC code of conduct
- For more details on the ETL process please visit the op-analytics documentation site.
- Visit the Optimism docs to learn more about the Superchain.
All tables are partitioned by date (dt
) and clustered by chain name (chain
). Data is updated daily with an approximate latency of 24 hrs.
Data is sourced from Goldsky using the Mirror platform, which fetches and delivers onchain data from RPC nodes to ClickHouse.
From there, the OP Labs team runs an incremental ingestion process based on block number.
For example, to get all transactions to the 4337 EntryPoint contracts for all OP Stack chains:
SELECT
dt,
chain,
COUNT(*) AS txns
FROM `optimism_superchain_raw_onchain_data.transactions`
WHERE
dt >= '2024-10-01'
AND to_address IN (
"0x5ff137d4b0fdcd49dca30c7cf57e578a026d2789",
"0x0000000071727de22e5e9d8baf0edac6f37da032"
)
GROUP BY dt, chain
ORDER BY dt, chain
Superchain 4337

The Superchain 4337 dataset provides comprehensive data about ERC-4337 (Account Abstraction) activity across all OP Stack chains. This dataset is crucial for understanding the adoption and usage patterns of smart contract wallets, bundlers, and paymasters in the Superchain ecosystem. The data helps track which applications are leading the way in improving onchain UX through account abstraction.
All tables are partitioned by date (dt
) and clustered by chain name (chain
). The dataset includes decoded UserOps and Paymaster IDs to trace events from unique user smart contract wallet addresses to their wallet providers, paymasters, bundlers, and the applications they interact with.
For example, to see which paymasters are handling the most UserOperations across all chains:
SELECT
chain,
paymaster,
APPROX_COUNT_DISTINCT(userophash) AS user_ops
FROM `YOUR_PROJECT_NAME.optimism_superchain_4337_account_abstraction_data.useroperationevent_logs_v2`
WHERE dt > '2025-02-01'
GROUP BY 1,2
ORDER BY 3 DESC
Arbitrum One

Arbitrum is a Layer 2 scaling solution for Ethereum. We currently provide blocks, transactions, and traces for Arbitrum's flagship network, Arbitrum One. This public dataset is maintained by our partners at Goldsky.
For example, to get the average gas per transaction for each block on Arbitrum One on September 1st, 2024:
SELECT
block_number,
receipt_gas_used / 1 AS gas_per_txn
FROM `YOUR_PROJECT_NAME.arbitrum_one.arbitrum_transactions`
WHERE block_timestamp BETWEEN 1693526400 AND 1693612800
ORDER BY block_number
Filecoin Lily

Filecoin is a decentralized storage network designed to store humanity's most important information. This dataset mirrors the dataset offered by Lily for use in the OSO data pipeline. It includes storage deals, miners, FVM transactions, and much more.
For example, this is how to get how many messages were sent to the network in the last month:
SELECT
COUNT(*) AS total_messages
FROM `YOUR_PROJECT_NAME.filecoin_lily.parsed_messages`
WHERE
CAST(timestamp_seconds((height * 30) + 1598306400) AS DATE) > CURRENT_DATE() - INTERVAL 1 MONTH
Farcaster
Farcaster is a decentralized social network built on Ethereum. This dataset mirrors the dataset offered by Indexing for use in the OSO data pipeline. It includes casts, links, reactions, verifications, and profiles.
For example, to get the users with the most lifetime reactions:
SELECT
r.target_cast_fid as fid,
json_value(p.data, "$.display") as display_name,
COUNT(*) as reaction_count
FROM `YOUR_PROJECT_NAME.farcaster.reactions` as r
LEFT JOIN `YOUR_PROJECT_NAME.farcaster.profiles` as p ON r.target_cast_fid = p.fid
GROUP BY fid, display_name
ORDER BY reaction_count DESC
Here's another use case, showing how to derive all the verified Ethereum addresses owned by a Farcaster user:
WITH
profiles AS (
SELECT
v.fid,
v.address,
p.custody_address,
JSON_VALUE(p.data, "$.username") AS username
FROM `YOUR_PROJECT_NAME.farcaster.verifications` v
JOIN `YOUR_PROJECT_NAME.farcaster.profiles` p ON v.fid = p.fid
WHERE v.deleted_at IS NULL
),
eth_addresses AS (
SELECT
fid,
username,
address
FROM profiles
WHERE LENGTH(address) = 42
UNION ALL
SELECT
fid,
username,
custody_address AS address
FROM profiles
)
SELECT DISTINCT
fid,
username,
address
FROM eth_addresses
Lens

Lens Protocol is an open social network. This dataset is well-documented by Lens. It includes data from Lens Chain Mainnet.
SELECT
address
FROM `lens-chain-mainnet.public.addresses`
LIMIT 5
Note: the Lens Polygon dataset that was previously available on BigQuery has been deprecated.
Gitcoin

Gitcoin is the hub of grantmaking in the Ethereum ecosystem. All project, round, and donation data from regendata.xyz is available in this dataset and updated daily.
You can also access the raw Gitcoin data directly on BigQuery:
SELECT
round_num,
round_name,
donor_address,
amount_in_usd,
recipient_address,
project_name,
project_id
FROM `YOUR_PROJECT_NAME.gitcoin.all_donations`
and:
SELECT
round_num,
round_name,
donor_address,
amount_in_usd,
recipient_address,
project_name,
project_id
FROM `YOUR_PROJECT_NAME.gitcoin.all_matching`
Open Collective

Open Collective is a platform for transparent finances and governance for open source projects.
The Open Collective datasets contains all transactions realized on the platform since its inception. Separate datasets are available for expenses and deposits.
For example, you can get the total amount of donations in USD
made to the pandas project:
SELECT
SUM(CAST(JSON_VALUE(amount, "$.value") AS FLOAT64)) AS total_amount
FROM
`YOUR_PROJECT_NAME.open_collective.deposits`
WHERE
JSON_VALUE(amount, "$.currency") = "USD"
AND JSON_VALUE(to_account, "$.id") = "ov349mrw-gz75lpy9-975qa08d-jeybknox"
OP Atlas
OP Atlas contains data about projects contributing to the Optimism Collective and applying for Retro Funding, shepherded by the fine folks of Agora.
For example, you can get a list of all projects and their associated repository URLs:
SELECT
p.name,
p.description,
r.url AS repo_url
FROM `YOUR_PROJECT_NAME.op_atlas.project` AS p
JOIN `YOUR_PROJECT_NAME.op_atlas.project_repository` AS r
ON p.id = r.project_id
For more information, go to https://atlas.optimism.io/ or check out the OP Atlas GitHub.
OpenRank

OpenRank is a reputation protocol that enables verifiable compute for a large class of reputation algorithms, in particular those that (a) operate on a graph, (b) are iterative in nature and (c) have a tendency toward convergence. Such algorithms include EigenTrust, Collaborative Filtering, Hubs and Authorities, Latent Semantic Analysis, etc.
In this dataset, Farcaster users reputations are scored in 2 ways:
- With
globaltrust
, we calculate global reputation scores, seeded by the trust of Optimism badgeholders. - With
localtrust
, you can get reputation scores of other users relative to specified user.
For example, you can get the globaltrust reputation score of vitalik.eth
SELECT
strategy_id,
i,
v,
date
FROM `YOUR_PROJECT_NAME.karma3.globaltrust`
WHERE i = 5650
Open Labels Initiative
A standardized framework and data model for EVM address labeling, maintained by growthepie.xyz.
For example, you can get project ownership labels for addresses:
SELECT
address,
chain_id,
tag_value AS project_name
FROM `YOUR_PROJECT_NAME.openlabelsinitiative.labels_decoded`
WHERE tag_id = 'owner_project'
Ethereum Attestation Service

Ethereum Attestation Service (EAS) is an infrastructure public good for making attestations onchain or offchain about anything.
This dataset mirrors the dataset offered by EAS for use in the OSO data pipeline, currenty showing all attestations on the Optimism network.
For example, you can get a list of all * official * RetroPGF badgeholders:
SELECT
recipient AS badgeholder,
JSON_VALUE(decoded_data_json, "$[0].value.value") AS rpgfRound
FROM `YOUR_PROJECT_NAME.ethereum_attestation_service_optimism.attestations`
WHERE
revoked = False
AND schema_id = '0xfdcfdad2dbe7489e0ce56b260348b7f14e8365a8a325aef9834818c00d46b31b'
AND attester = '0x621477dBA416E12df7FF0d48E14c4D20DC85D7D9'
Subscribe to a dataset
1. Data exchange listings
For datasets listed on the OSO public data exchange, click on the "Subscribe on BigQuery" button to create a new dataset that is linked to OSO.
This has a few benefits:
- Data is automatically kept live and real-time with OSO
- You keep a reference to the data in your own GCP project
- This gives OSO the ability to track public usage of models
2. Direct access to datasets
For datasets without a listing on the OSO public data exchange, we make the dataset open to public queries for direct queries. Click on the "View on BigQuery" button to go straight to the dataset.
You can star the dataset to keep it in your project.
Cost Estimation
BigQuery on-demand pricing charges based on the number of bytes scanned, with the first 1 TB free every month.
To keep track of your usage, check the bytes scanned in the top right corner before running your query.
BigQuery costs can rack up quickly if you are not careful optimizing your queries. For more on how you can optimize your queries, check out these guide:
Next steps
Once you've subscribed our datasets to your own Google project, you can start to run queries and analyses with a variety of tools.