tips

Who contributed the most to open source in 2017? Let’s analyze GitHub’s data and find out.

Yashu Mittal

Top contributor on Github in 2017

For this analysis we’ll look at all the PushEvents published by GitHub during 2017. For each GitHub user we’ll have to make our best guess to determine to which organization they belong. We’ll only look at repositories that have received at least 20 stars this year.

Here are the results I got, which you can tinker with in my the interactive Data Studio report.

Comparing the top cloud providers

Looking at GitHub during 2017:

Google, microsoft, amazon repo

RedHat, IBM, Pivotal, Intel, and Facebook

If Amazon seems so far behind Microsoft and Google — what are the companies in between? According to this ranking RedHat, Pivotal, and Intel are pushing great contributions to GitHub:

Note that the following table combines all of IBM regional domains — while the individual regions still show up in the subsequent tables.

redhat, ibm

pivotal, intel, fb

Facebook and IBM (US) have a similar number of GitHub users than Amazon, but the projects they contribute to have collected more stars (especially Facebook):

fb, ibm, amazon

Followed by Alibaba, Uber, and Wix:

alibaba, uber, wix

GitHub itself, Apache, Tencent:

GitHub itself, Apache, Tencent

Baidu, Apple, Mozilla:

Baidu, Apple, Mozilla

Oracle, Stanford, Mit, Shopify, MongoDb, Berkeley, VmWare, Netflix, Salesforce, Gsa.gov:

Oracle, Stanford, Mit, Shopify, MongoDb, Berkeley, VmWare, Netflix, Salesforce, Gsa.gov

LinkedIn, Broad Institute, Palantir, Yahoo, MapBox, Unity3d, Automattic, Sandia, Travis-ci, Spotify:

LinkedIn, Broad Institute, Palantir, Yahoo, MapBox, Unity3d, Automattic, Sandia, Travis-ci, Spotify

Chromium, UMich, Zalando, Esri, IBM (UK), SAP, EPAM, Telerik, UK Cabinet Office, Stripe:

Chromium, UMich, Zalando, Esri, IBM (UK), SAP, EPAM, Telerik, UK Cabinet Office, Stripe

Cern, Odoo, Kitware, Suse, Yandex, IBM (Canada), Adobe, AirBnB, Chef, The Guardian:

Cern, Odoo, Kitware, Suse, Yandex, IBM (Canada), Adobe, AirBnB, Chef, The Guardian

Arm, Macports, Docker, Nuxeo, NVidia, Yelp, Elastic, NYU, Wso2, Mesosphere, Inria:

Arm, Macports, Docker, Nuxeo, NVidia, Yelp, Elastic, NYU, Wso2, Mesosphere, Inria

Puppet, Stanford (CS), DatadogHQ, Epfl, NTT Data, Lawrence Livermore Lab:

Puppet, Stanford (CS), DatadogHQ, Epfl, NTT Data, Lawrence Livermore Lab

My Methodology

How I linked GitHub users to companies

Determining the organization to which each GitHub user belongs it’s not easy — but we can use the email domains that show up in each commit message contained on PushEvents:

My query

#standardSQL
WITH
period AS (
  SELECT *
  FROM `githubarchive.month.2017*` a
),
repo_stars AS (
  SELECT repo.id, COUNT(DISTINCT actor.login) stars, APPROX_TOP_COUNT(repo.name, 1)[OFFSET(0)].value repo_name
  FROM period
  WHERE type='WatchEvent'
  GROUP BY 1
  HAVING stars>20
),
pushers_guess_emails_and_top_projects AS (
  SELECT *
    # , REGEXP_EXTRACT(email, r'@(.*)') domain
    , REGEXP_REPLACE(REGEXP_EXTRACT(email, r'@(.*)'), r'.*.ibm.com', 'ibm.com') domain
  FROM (
    SELECT actor.id
      , APPROX_TOP_COUNT(actor.login,1)[OFFSET(0)].value login
      , APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, '$.commits[0].author.email'),1)[OFFSET(0)].value email
      , COUNT(*) c
      , ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name,stars))) repos
    FROM period a
    JOIN repo_stars b
    ON a.repo.id=b.id
    WHERE type='PushEvent'
    GROUP BY  1
    HAVING c>3
  )
)
SELECT * FROM (
  SELECT domain
    , githubers
    , (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) repos_contributed_to
    , ARRAY(
        SELECT AS STRUCT JSON_EXTRACT_SCALAR(repo, '$.repo_name') repo_name
        , CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64) stars
        , COUNT(*) githubers_from_domain FROM UNNEST(repos) repo
        GROUP BY 1, 2
        HAVING githubers_from_domain>1
        ORDER BY stars DESC LIMIT 3
      ) top
    , (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64)) FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) sum_stars_projects_contributed_to
  FROM (
    SELECT domain, COUNT(*) githubers, ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) repos
    FROM pushers_guess_emails_and_top_projects
    #WHERE domain IN UNNEST(SPLIT('google.com|microsoft.com|amazon.com', '|'))
    WHERE domain NOT IN UNNEST(SPLIT('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru', '|')) # email hosters
    GROUP BY 1
    HAVING githubers > 30
  )
  WHERE (SELECT MAX(githubers_from_domain) FROM (SELECT repo, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo  GROUP BY repo))>4 # second filter email hosters
)
ORDER BY githubers DESC

FAQ

If an organization has 1,500 repositories, why do you only count 200? If a repository has 7,000 stars, why do you only show 1,500?

I’m filtering for relevancy. I’m only counting stars given during 2017. For example, Apache has >1,500 repositories on GitHub, but only 205 have received more than 20 stars this year.

How I linked GitHub users to companies

Is this the state of open source?

Note that analyzing GitHub doesn’t include top communities like Android, Chromium, GNU, Mozilla, nor the the Apache or Eclipse Foundation, and other projects that choose to run most of their activities outside of GitHub.

You were unfair to my organization.

I can only count what I can see. Please challenge my assumptions and tell me how you would measure things in a better way. Working queries would be the best way.

For example, see how their ranking changes when I combine IBM’s region-based domains into their top one with one SQL transformation:

SELECT *, REGEXP_REPLACE(REGEXP_EXTRACT(email, r'@(.*)'), r'.*.ibm.com', 'ibm.com') domain

IBM’s relative position moves significantly when you combine their regional email domains.

IBM’s relative position moves significantly when you combine their regional email domains.

IBM’s relative position moves significantly when you combine their regional email domains.

This blog is originally published by Felipe Hoffa

Response to “Who contributed the most to open source in 2017? Let’s analyze GitHub’s data and find out.”

Stay current

Sign up for our newsletter, and we'll send you news and tutorials on business, growth, web design, coding and more!