在 Google 表格中创建空数据框答案

【问题标题】：Create empty dataframes in Google Sheets在 Google 表格中创建空数据框
【发布时间】：2021-02-19 06:33:51
【问题描述】：

我正在使用 python、BeautifulSoup、pandas 和 Google Sheets 创建一个网页抓取程序。到目前为止，我已经设法从我从 Google 表格中的列表中获取的 url 中抓取数据表——我已经为每个数据集创建了数据框。从我的 url 列表中，该列中的某些单元格是空的，当我尝试将数据框导入另一个工作表时，这给了我以下错误：

MissingSchema：无效的 URL ''：未提供架构。也许你的意思是 http://?

我想要实现的是，对于带有 url 的表格中的每个空单元格，我想创建一个空数据框，就像其中包含数据的那些单元格一样。这可能吗？

到目前为止，我的代码如下所示：

import gspread
from df2gspread import df2gspread as d2g
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
from bs4 import BeautifulSoup
import pandas as pd
import requests

credentials = service_account.Credentials.from_service_account_file(
    'credentials.json')

scoped_credentials = credentials.with_scopes(
        ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
        )

gc = gspread.Client(auth=scoped_credentials)
gc.session = AuthorizedSession(scoped_credentials)
spreadsheet_key = gc.open_by_key('api_key')


# Data import
data_worksheet = spreadsheet_key.worksheet("Data")

# Url's
url_worksheet = spreadsheet_key.worksheet("Urls")

link_list = url_worksheet.col_values(2)


def get_info(linkIndex) :

    page = requests.get(link_list[linkIndex])
    soup = BeautifulSoup(page.content, 'html.parser')

    try :
        tbl = soup.find('table')

        labels = [] 
        results = []

        for tr in tbl.findAll('tr'):
            headers = [th.text.strip()  for th in tr.findAll('th')]
            data = [td.text.strip() for td in tr.findAll('td')]
            labels.append(headers)
            results.append(data) 

        final_results = []

        for final_labels, final_data in zip(labels, results):
            final_results.append({'Labels': final_labels, 'Data': final_data})

        df = pd.DataFrame(final_results)

        df['Labels'] = df['Labels'].str[0]
        df['Data'] = df['Data'].str[0]

        indexNames = df[df['Labels'] == 'Links'].index
        df.drop(indexNames , inplace=True)

        set_with_dataframe(data_worksheet, df, col=(linkIndex*6)+1, row=2, 
include_column_header=False)[1:]

    except Exception as e:
        print(e)

for linkInd in range(len(link_list))[1:] :
    get_info(linkInd)

【问题讨论】：

标签： python pandas dataframe web-scraping google-sheets

【解决方案1】：

这取决于您所说的空数据框是什么意思。如果该数据框不包含数据，则可以使用语句pd.DataFrame() 创建它。如果该数据帧在与其他数据帧相同的列中包含 np.NaN / None 值，则可以从字典创建：

import pandas as pd

# x is the amount of rows in dataframe
d = {
    'column1': [np.NaN] * x,
    'column2': [np.NaN] * x,
    'column3': [np.NaN] * x
}

df = pd.DataFrame(d)

get_info() 函数的开头应该有一个检查：

if link_list[linkIndex] is not None: # or if link_list[linkIndex] != '' (depending on format of an empty cell)

在if 部分应该放置已经存在的逻辑，在else 部分应该创建一个空数据框。函数set_with_dataframe() 应该在if / else 语句之后调用，因为它在两种情况下都会执行。

【讨论】：