Skip to main content

Vector Normalization

info

An impact vector is a list of normalized impact metrics for a project at a discrete point in time. This specification outlines how to create, define, and submit impact vectors.

Whereas an impact metric is a single, absolute measurement of a project's impact (e.g., a project's fork count), an impact vector is a series of related metrics about a project normalized to a 0-1 scale and captured at a discrete point in time. Thus, a project's impact at a given point in time may be derived from a set of metrics such as fork_count, star_count, contributor_count, etc. This approach is useful for estimating a project's overall impact in an ecosystem.

Alternatively, an impact vector can also be represented from the perspective of the metric itself. In this case, the fork_count vector would represent the normalized fork_count of every project in the set of projects at a given point in time. This construction is useful for analyzing specific types of impact or ranking projects.

Requirements


An impact vector must be:

  • Derived from a quantitative impact metric that is applied consistently to a set of open source projects.
  • Measured at a specific point in time (the same time for all projects in the set).
  • Normalized to a common scale (0 to 1) across the set of projects.
  • Accompanied by reproducible code that describes how the vector is calculated and normalized.

Sample Vectors


Below are several examples of impact vectors that can be applied to assess the influence of open source software projects within a crypto ecosystem. These examples illustrate the breadth of metrics that can be leveraged to gauge various dimensions of impact.

  • Name: Grow Full-Time OSS Developers
  • Tags: OSS, developers
  • Metric: Number of Full-Time Developer Months
  • Time Period: Last 6 months
  • Selection Filter: Projects must have activity from at least two contributors in the last 90 days; projects must have a permissive open source license (e.g., MIT, Apache 2.0); projects must have a codebase that is at least 6 months old.
  • Query Logic: The following SQL query fetches the metric for each project in the selection set.
  • -- This is pseudocode and should be replaced with a real query
    SELECT
    project_id,
    SUM(full_time_developer_months) AS full_time_developer_months
    FROM
    `oso.contributor_activity`
    JOIN `oso.projects` USING (project_id)
    WHERE
    date >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH)
    AND license IN ('MIT', 'Apache 2.0')
    AND codebase_age >= 180
    GROUP BY
    project_id
  • Normalization Method: The following Python code normalizes the metric data using a Gaussian distribution.
  • # This is pseudocode and should be replaced with a real script
    import numpy as np
    import pandas as pd
    from scipy.stats import norm

    # Fetch the metric data using the SQL logic above
    metric_data = fetch_metric_data(query)

    # Normalize the metric data
    normalized_data = norm.ppf((metric_data.rank() - 0.5) / len(metric_data))

    # Create a table with the project, metric, and normalized metric
    normalized_table = pd.DataFrame({
    'project_id': metric_data.index,
    'full_time_developer_months': metric_data.values,
    'normalized_full_time_developer_months': normalized_data
    })