基础数据来自:www.gharchive.orggithub
获取 GitHub 2019 年的 PushEvent,经过分析 GitHub 用户提交记录中的邮件地址,分辨其所属组织。web
具体方法参考:www.freecodecamp.org/news/the-to…sql
因为 Google Big Query 每个月只能免费获取 1TB 的数据处理量,所以,为了充分利用它,咱们将数据查询限制在必定的日期范围(20190301-20191001)内,确保数据处理量接近而不超过 1TB。dom
此日期范围内的数据可大体反映 2019 整年 GitHub 各组织开源贡献度状况。工具
SELECT *
FROM `githubarchive.month.2019*` a
WHERE _TABLE_SUFFIX BETWEEN '0301' AND '1001'
复制代码
完整的 SQL 语句编写以下:google
#standardSQL
WITH
period AS (
SELECT *
FROM `githubarchive.month.2019*` a
WHERE _TABLE_SUFFIX BETWEEN '0301' AND '1001'
),
repo_stars AS (
SELECT repo.id, COUNT(DISTINCT actor.login) stars, APPROX_TOP_COUNT(repo.name, 1)[OFFSET(0)].value repo_name
FROM period
WHERE type='WatchEvent'
GROUP BY 1
HAVING stars>20
),
pushers_guess_emails_and_top_projects AS (
SELECT *, REGEXP_EXTRACT(email, r'@(.*)') domain
FROM (
SELECT actor.id
, APPROX_TOP_COUNT(actor.login,1)[OFFSET(0)].value login
, APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, '$.commits[0].author.email'),1)[OFFSET(0)].value email
, COUNT(*) c
, ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name,stars))) repos
FROM period a
JOIN repo_stars b
ON a.repo.id=b.id
WHERE type='PushEvent'
GROUP BY 1
HAVING c>3
)
)
SELECT * FROM (
SELECT domain
, githubers
, (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) repos_contributed_to
, ARRAY(
SELECT AS STRUCT JSON_EXTRACT_SCALAR(repo, '$.repo_name') repo_name
, CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64) stars
, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo
GROUP BY 1, 2
HAVING githubers_from_domain>1
ORDER BY stars DESC LIMIT 3
) top
, (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64)) FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) sum_stars_projects_contributed_to
FROM (
SELECT domain, COUNT(*) githubers, ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) repos
FROM pushers_guess_emails_and_top_projects
#WHERE domain IN UNNEST(SPLIT('google.com|microsoft.com|amazon.com', '|'))
WHERE domain NOT IN UNNEST(SPLIT('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru', '|')) # email hosters
GROUP BY 1
HAVING githubers > 30
)
WHERE (SELECT MAX(githubers_from_domain) FROM (SELECT repo, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo GROUP BY repo))>4 # second filter email hosters
)
ORDER BY githubers DESC
复制代码
从下图中能够看到,本次查询统计将会处理 918.4GB 的数据。 spa
点击运行,通过 17.8s,咱们能够看到查询结果。 3d
排在 6-10 位的分别是 Pivotal、Facebook、Apache、SAP 和 Shopify。 code
国内大厂开源贡献度最高的当属阿里员工,排在第十二位,top3 仓库分别是 flutter-go、nacos 和 sqlflow,全部项目共得到 stars 数超过 90000。
百度和腾讯则分列 2一、23 位。
开源贡献度前 38 位名单以下:
有什么想法,欢迎留言区与我互动,也欢迎关注个人公众号“Doocs开源社区”,原创技术文章第一时间推送!