从 webscrape 将 1D 数组文本转换为 2D pandas DF答案

【问题标题】：Text 1D array into a 2D pandas DF from a webscrape从 webscrape 将 1D 数组文本转换为 2D pandas DF
【发布时间】：2020-05-13 11:27:22
【问题描述】：

您好，我已经使用以下代码对数据表进行了网页抓取：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

df = pd.DataFrame()


for row in links2get:
    url = row
    response = requests.get(url)
    html_page = response.content
    soup = BeautifulSoup(html_page, 'html.parser')
    text = soup.find_all(text=True)
    for a in soup.select('.trackM'):
        b = a.get_text()
        array = np.array(b)
        print(array)
        #reshape = ????
        #df = df.append(reshape)

我拥有的数组的输出是：

print(array):

Table Title


Heading 1
Heading 2
Heading 3
Heading 4
Heading 5


1084
316
No
72
Yes

编辑有时表格中缺少值，因此可能存在奇数个元素（例如，5 个标题列但只有 4 个值）。

我希望重塑成一个 DataFrame，所以它看起来像：

print(df):

Heading 1   Heading 2   Heading 3   Heading 4   Heading 5
1084           316         No          72           Yes

我在重塑时遇到了麻烦，所以如果有人有任何建议，那就太好了！谢谢！

【问题讨论】：

标签： python pandas beautifulsoup

【解决方案1】：

如果您知道表格的 url 和标题，您可以简单地执行此操作。

import pandas as pd
df = pd.read_html(url, match='Table Title')[0]

如果您将表格作为文本，从美丽的汤中提取。你可以简单地做到这一点。

import pandas as pd
table_string = '''<table>
  <tr>
    <th>heading 1</th>
    <th>heading 2</th>
    <th>heading 3</th>
    <th>heading 4</th>
  </tr>
  <tr>
    <td>1084</td>
    <td>316</td>
    <td>No</td>
    <td>72</td>
    <td>Yes</td>
  </tr>
</table>'''

df = pd.read_html(table_string)[0]

输出：

   heading 1  heading 2 heading 3  heading 4 Unnamed: 4
0       1084        316        No         72        Yes

【讨论】：

嗨@visibleman。是的，我可以这样做，但是当我运行 df = pd.read_html(table_string) 时，我收到 TypeError: 'NoneType' object is not callable 错误。我的table_string 的开头看起来像：<div class="InfoTrack trackM"> <h3 style="margin-bottom:15px;">Data</h3> <table> <tr class="fullHeader"> <th>heading 1</th> ...
我在 read_html 语句中添加了一个 index[0]，因为 read_html 返回一个列表。此外，我认为您需要使用汤来仅提取表格元素（及其子元素），而不提取 div 等...即使所有这些...我不确定您为什么会收到该错误。一些基本的谷歌搜索似乎表明将参数 flavor='bs4' 添加到 read_html 调用可能会有所帮助。
别误会我的意思，尝试使用自己的代码解析它绝对没有错；建议使用上述解决方案作为一种简单的节省时间的方法，如果它不能节省您任何时间，那么请务必使用您所拥有的和 Dave 提供的建议。
谢谢@visibleman。我正在尝试你自己和戴夫的两种方法。我只提取到表中，这很好，但会尝试找出对象不可调用的原因。感谢您的帮助！
我认为不可调用的对象可能与旧的 pandas 版本有关？或者可能与 lxml 包等缺少的依赖项有关。通过在 read_html 调用中添加风味参数，您也许可以解决丢失的包....但是我自己没有看到这个，这只是我对之前快速谷歌搜索的基本理解。

【解决方案2】：

你在正确的轨道上。不过，从一系列干净的令牌开始。这将使下游更简单。

[e for e in b.splitlines()[1:] if len(e)]

# b.splitlines()     -> Split the text output into a list at the linebreaks.
# b.splitlines()[1:] -> Drop the first element of the list ("Table Title")
# ... if len(e)      -> Only keep the token if it has length greater than zero (i.e. not the empty string.)

在这一部分中，您始终要创建具有偶数个元素的系列，这一点至关重要：

...
b = a.get_text()
s = pd.Series([e for e in b.splitlines()[1:] if len(e)])

Out[54]: 
0    Heading 1
1    Heading 2
2    Heading 3
3    Heading 4
4    Heading 5
5         1084
6          316
7           No
8           72
9          Yes

现在直接重塑为 DataFrame。因为我们知道Series的元素个数是偶数，所以可以reshape成两行长度int(len(s) / 2)：

df = pd.DataFrame(s.values.reshape((2, int(len(s) / 2))))

           0          1          2          3          4
0  Heading 1  Heading 2  Heading 3  Heading 4  Heading 5
1       1084        316         No         72        Yes

现在，我们从第一行开始分配列：

df.columns = df.iloc[0]

0  Heading 1  Heading 2  Heading 3  Heading 4  Heading 5
0  Heading 1  Heading 2  Heading 3  Heading 4  Heading 5
1       1084        316         No         72        Yes

最后，删除我们用于列的行：

df.drop(df.index[0])

0 Heading 1 Heading 2 Heading 3 Heading 4 Heading 5
1      1084       316        No        72       Yes

【讨论】：

这很棒。非常感谢@Dave。我现在在遍历多个表时要解决的问题是，并非每个表都包含偶数个元素。总是有 5 Heading 行，但有时标题表中的数据丢失或 1 列有两个元素。为了解决这个问题，写了一个if len(s) <10: s.append('NA',NA','NA',NA') t = pd.Series(s[1:11]) 这很棒但是在几个循环中我得到了一个TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
我现在通过将(NA',NA','NA',NA') 添加为自己的pd.Series() 来解决此问题
我现在意识到问题是有时表数据中没有值，因此它会将该单元格作为元素跳过。你知道有什么办法可以解决这个问题吗？
感谢您的帮助@Dave。我最终能够使用 pd.read_html 解决方案来解决无价值问题
是的，这是最好的方法。我自己赞成这个答案！