【问题标题】:How to read a link from a cell in Google Spreadsheet if it's inside href tag (gspread)如果链接在href标签(gspread)内,如何从Google电子表格中的单元格读取链接
【发布时间】:2021-02-22 06:57:26
【问题描述】:

我是stackoverflow的新手,如果我做错了什么,请提前道歉

我在 Google 工作表上有一个电子表格,例如,this one

并且在href标签内的单元格中有一个链接。我想使用 Google Sheets API 或 gspread 获取单元格的链接和文本。

我已经尝试过this solution,但我得到了访问令牌“无”。

我曾尝试使用 beautifulsoup 进行网页抓取,但效果不佳。

至于 bs4 解决方案,我尝试使用此代码,发现 here

from bs4 import BeautifulSoup
import requests

html = requests.get('https://docs.google.com/spreadsheets/d/1v8vM7yQ-27SFemt8_3IRiZr-ZauE29edin-azKpigws/edit#gid=0').text
soup = BeautifulSoup(html, "lxml")
tables = soup.find_all("table")

content = []

for table in tables:
    content.append([[td.text for td in row.find_all("td")] for row in table.find_all("tr")])

print(content)

【问题讨论】:

  • 你能不能试试=REGEXEXTRACT(FORMULATEXT(A2),"""(.+)"",")A2改成正确的单元格。你用beautifulsoup做了什么你能发布代码吗?
  • @manakin 我编辑了问题并添加了代码
  • 据我所知,Google 电子表格不是纯 HTML 表格,因此将其解析为 HTML 表格并尝试在标签 之间获取数据可能没有意义.
  • @YuriKhristich 那么怎么做会更好呢?
  • 我确定应该通过 Google Drive API 来完成。像这样的东西:developers.google.com/sheets/api/quickstart/pythontwilio.com/blog/2017/02/…gspread.readthedocs.io/en/latest 但我还没有尝试过。我帮不了你

标签: python google-sheets gspread


【解决方案1】:

我想通了。如果有人需要,这是完整的代码

import requests
import gspread
import urllib.parse
import pickle



spreadsheetId = "###"  # Please set the Spreadsheet ID.
cellRange = "Yoursheetname!A1:A100"  # Please set the range with A1Notation. In this case, the hyperlink of the cell "A1" of "Sheet1" is retrieved.


with open('token_sheets_v4.pickle', 'rb') as token:
    # get this file here
    # https://developers.google.com/identity/sign-in/web/sign-in
    credentials = pickle.load(token)

client = gspread.authorize(credentials)

# 1. Retrieve the access token.
access_token = client.auth.token

# 2. Request to the method of spreadsheets.get in Sheets API using `requests` module.
fields = "sheets(data(rowData(values(hyperlink))))"
url = "https://sheets.googleapis.com/v4/spreadsheets/" + spreadsheetId + "?ranges=" + urllib.parse.quote(cellRange) + "&fields=" + urllib.parse.quote(fields)
res = requests.get(url, headers={"Authorization": "Bearer " + access_token})
print(res)

# 3. Retrieve the hyperlink.
obj = res.json()
print(obj)
link = obj["sheets"][0]['data'][0]['rowData'][0]['values'][0]['hyperlink']
print(link)

更新!!

更优雅的解决方案是这样。创建服务:

CLIENT_SECRET_FILE = 'secret/secret.json'
API_SERVICE_NAME = 'sheets'
API_VERSION = 'v4'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']


def Create_Service():
    cred = None

    pickle_file = f'secret/token_{API_SERVICE_NAME}_{API_VERSION}.pickle'
if os.path.exists(pickle_file):
    with open(pickle_file, 'rb') as token:
        cred = pickle.load(token)

if not cred or not cred.valid:
    if cred and cred.expired and cred.refresh_token:
        cred.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRET_FILE, SCOPES)
        cred = flow.run_local_server()

    with open(pickle_file, 'wb') as token:
        pickle.dump(cred, token)

try:
    service = build(API_SERVICE_NAME, API_VERSION, credentials=cred)
    print(API_SERVICE_NAME, 'service created successfully')
    return service
except Exception as e:
    print('Unable to connect.')
    print(e)
    return None

service = Create_Service()

并以方便的字典形式从电子表格中的每个工作表中提取链接

    fields = "sheets(properties(title),data(startColumn,rowData(values(hyperlink))))"
    
    print(service.spreadsheets().get(spreadsheetId=self.__spreadsheet_id,
                                     fields=fields).execute())

那么,字段是如何工作的。我们转到Spreadsheet object description 并寻找 JSON 表示。如果我们想从那个 json 表示返回,例如 sheet 对象,我们只需使用这个 fields = "sheets",因为 Spreadsheet 有字段“sheets”它的 json 表示。

好的,很酷。我们得到了床单对象。如何访问工作表对象字段?只需点击那个东西并寻找它的字段。

那么,如何组合字段呢?这很容易。例如,我想从工作表对象返回字段“属性”和“数据”,我这样写字段字符串:fields = "sheets(properties,data)"。所以我们只是将它们列为普通函数中的参数,但没有空格。

这同样适用于返回数据字段等的对象。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-08-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-05-11
    • 2016-09-25
    • 2013-10-30
    相关资源
    最近更新 更多