从 Github 获取具有特定提交哈希的许可证链接答案

【问题标题】：Get license link from Github with specific commit hash从 Github 获取具有特定提交哈希的许可证链接
【发布时间】：2022-01-03 12:18:45
【问题描述】：

我有一个（主要是）github repos 的表（作为 Pandas DF），我需要为此自动提取 LICENSE 链接。但是，要求该链接不只是简单地转到 /blob/master/，而是实际上指向特定的提交，因为主链接可能会在某个时候更新。我通过 github API 组装了一个 Python 脚本来执行此操作，但使用该 API 我只能检索带有主标记的链接。

即而不是
https://github.com/jsdom/abab/blob/master/LICENSE.md
我要
https://github.com/jsdom/abab/blob/8abc2aa5b1378e59d61dee1face7341a155d5805/LICENSE.md

是否有办法自动获取文件（在本例中为 LICENSE 文件）的最新提交的链接？

这是我目前写的代码：

def githubcrawl(repo_url, session, headers):
    parts = repo_url.split("/")[3:]
    url_tmpl = "http://api.github.com/repos/{}/license"
    url = url_tmpl.format("/".join(parts))
    try:
        response = session.get(url, headers=headers)
        if response.status_code in [404]:
            return(f"404: {repo_url}")
        else:
            data = json.loads(response.text)
            return(data["html_url"]) # Returns the html URL to LICENSE file
    except urllib.error.HTTPError as e:
        print(repo_url, "-", e)
        return f"http_error: {repo_url}"

token="mytoken" # Token for github authentication to get more requests per hour
headers={"Authorization": "token %s" % token}

session = requests.Session()
lizlinks = [] # List to store the links of the LICENSE files in

# iterate over DataFrame of applications/deps
for idx, row in df.iterrows():
#    if idx < 5:
        if type(row["Homepage"]) == type("str"):
            repo_url = re.sub(r"\#readme", "", row["Homepage"])
            response = session.get(repo_url, headers=headers) 
            repo_url = response.url # Some URLs are just redirects, so I get the actual repo url here
            if "github" in repo_url and len(repo_url.split("/")) >= 3:
                link = githubcrawl(repo_url, session, headers)
                print(link)
                lizlinks.append(link)
            else:
                print(row["Homepage"], "Not a github Repo")
                lizlinks.append("Not a github repo")
        else:
            print(row["Homepage"], "Not a github Repo")
            lizlinks.append("Not a github repo")

Bonus-Question：并行化此任务是否适用于 Github-API？ IE。我可以一次发送多个请求而不会被锁定（DoS）还是for循环是避免这种情况的好方法？浏览我在该列表中的 1000 多个存储库需要相当长的时间。

【问题讨论】：

您在这里想到的最新消息是什么？通常这就是 master/main/stable 分支的用途。您可能会枚举存储库的所有分支和标签并按时间排序，但这可能会产生误报，因为它们甚至可能不会考虑进入真实产品。
我的意思是特定分支中特定文件的最新提交（主分支很好）。例如，如果您在 github 上点击此链接：github.com/jsdom/abab/commits/master/LICENSE.md master 分支的顶部提交 (8abc2aa5b1378e59d61dee1face7341a155d5805) 本质上是当前的 master 版本，但如果有人要进行另一个提交，github.com/jsdom/abab/blob/master/LICENSE.md 中的 LICENSE 将更改，而8abc2aa5b1378e59d61dee1face7341a155d5805 将保持不变。

标签： python pandas github github-api

【解决方案1】：

好的，我找到了一种方法来获取当前提交的唯一 SHA-hash。我认为应该始终链接到该时间点的许可证文件。

使用 python git 库，我只需运行 ls_remote git 命令并返回 HEAD sha

def lsremote_HEAD(url):
    g = git.cmd.Git()
    HEAD_sha = g.ls_remote(url).split()[0]
    return HEAD_sha

然后我可以替换我的 github_crawl 函数中的“master”、“main”或任何标签：

token="token_string"
headers={"Authorization": "token %s" % token}
session = requests.Session()
def githubcrawl(repo_url, session, headers):
    parts = repo_url.split("/")[3:]
    api_url_tmpl = "http://api.github.com/repos/{}/license"
    api_url = api_url_tmpl.format("/".join(parts))
    try:
        print(api_url)
        response = session.get(api_url, headers=headers)
        if response.status_code in [404]:
            return(f"404: {repo_url}")
        else:
            data = json.loads(response.text)
            commit_link = re.sub(r"/blob/.+?/",rf"/blob/{lsremote_HEAD(repo_url)}/", data["html_url"])
            return(commit_link)
    except urllib.error.HTTPError as e:
        print(repo_url, "-", e)
        return f"http_error: {repo_url}"

也许这对某人有帮助，所以我在这里发布这个答案。

此答案使用以下库：

import re
import git
import urllib
import json
import requests

【讨论】：