【问题标题】:Python get list of csv files in public GitHub repositoryPython 获取公共 GitHub 存储库中的 csv 文件列表
【发布时间】:2020-07-10 11:39:54
【问题描述】:

我正在尝试使用 Python 从public repository 中提取一些 csv 文件。在获得文件的 URL 后,我就有了处理数据的代码。 GitHub是否有某种相当于ls的东西?我在 GitHub 的 API 中没有看到任何内容,而且似乎可以使用 PyCurl,但是我需要通过 html 进行解析。有没有预建的方法来做到这一点?

【问题讨论】:

标签: python python-3.x pycurl pygithub


【解决方案1】:

BeautifulSoup(hacky 并且可能非常低效)解决方案:

# Import the required packages: 
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re 

# Store the url as a string scalar: url => str
url = "https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports"

# Issue request: r => requests.models.Response
r = requests.get(url)

# Extract text: html_doc => str
html_doc = r.text

# Parse the HTML: soup => bs4.BeautifulSoup
soup = BeautifulSoup(html_doc)

# Find all 'a' tags (which define hyperlinks): a_tags => bs4.element.ResultSet
a_tags = soup.find_all('a')

# Store a list of urls ending in .csv: urls => list
urls = ['https://raw.githubusercontent.com'+re.sub('/blob', '', link.get('href')) 
        for link in a_tags  if '.csv' in link.get('href')]

# Store a list of Data Frame names to be assigned to the list: df_list_names => list
df_list_names = [url.split('.csv')[0].split('/')[url.count('/')] for url in urls]

# Initialise an empty list the same length as the urls list: df_list => list
df_list = [pd.DataFrame([None]) for i in range(len(urls))]

# Store an empty list of dataframes: df_list => list
df_list = [pd.read_csv(url, sep = ',') for url in urls]

# Name the dataframes in the list, coerce to a dictionary: df_dict => dict
df_dict = dict(zip(df_list_names, df_list))

【讨论】:

    猜你喜欢
    • 2021-09-08
    • 2019-11-17
    • 2013-12-11
    • 2020-01-21
    • 2017-07-29
    • 2021-01-19
    • 2019-05-03
    • 1970-01-01
    • 2018-10-07
    相关资源
    最近更新 更多