.find(text=True) 在 BeautifulSoup4 中是如何工作的？答案

【问题标题】：How does .find(text=True) work in BeautifulSoup4?.find(text=True) 在 BeautifulSoup4 中是如何工作的？
【发布时间】：2021-10-30 19:02:49
【问题描述】：

尝试从以下位置提取维基百科列表：https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes 使用 BeautifulSoup。

这是我的代码：

wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia

Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==9: # The start and end don't include a <td> tag
        for i in range(9):
            Data[i].append(cells[i].find(text=True))

除了名称列中的单个值“新英格兰”飓风之外，这非常有效。这是包含该元素的 HTML 代码：

<td><span data-sort-value="New England !"> <a href="/wiki/1938_New_England_hurricane" title="1938 New England hurricane">"New England"</a></span></td>

那个飓风中名字的条目是''，我认为<span>和<a>之间的空格导致了这个问题。有没有办法在.find 中解决这个问题？有没有更聪明的方法来访问维基百科中的列表？以后如何避免这种情况？

【问题讨论】：

标签： python beautifulsoup wikipedia

【解决方案1】：

将table 读入数据框的最简单方法是read_html()：

import pandas as pd
pd.read_html(wiki)[1]

输出：

    Name    Dates as aCategory 5    Duration as aCategory 5 Sustainedwind speeds    Pressure    Areas affected  Deaths  Damage(USD) Refs
0   "Cuba"  October 19, 1924    12 hours    165 mph (270 km/h)  910 hPa (26.87 inHg)    Central America, Mexico, CubaFlorida, The Bahamas   90  NaN [12]
1   "San Felipe IIOkeechobee"   September 13–14, 1928   12 hours    160 mph (260 km/h)  929 hPa (27.43 inHg)    Lesser Antilles, The BahamasUnited States East...   4000    NaN NaN

...

为了改进您的示例，您可以执行以下操作：

import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = requests.get(wiki).content
soup = BeautifulSoup(page,'lxml')
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia

data = []
for row in table.select('tr')[1:-1]:
    cells = []
    for cell in row.select('td'):
        cells.append(cell.get_text('',strip=True))
    data.append(cells)

get_text('',strip=True) 将从td 获取文本并去除前面/结尾的空格。

【讨论】：

谢谢:)。我找不到有关此问题的任何文档，如果确实发生 BS4 问题，我将来可以去哪里？
而且，较早的评论（现在似乎已删除）建议使用 pandas，并且确实可以减少很多麻烦。但是损坏值都是 NaN，而在 BS4 中不会发生这种情况。有没有快速解决办法？
您可以使用docs here
我尝试使用它们，但没有找到有关 .find 选项的任何信息。你根本没用过，BS4好像有些冗余
在您的示例中，您正在处理 "older Version/Syntax" 以了解有关 find_all()/find()start here 的更多信息

【解决方案2】：

这将使文本规范化，并希望为您提供所需的内容：-

import urllib
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
# The class of the list in wikipedia
table = soup.find('table', class_="wikitable sortable")

Data = [[] for _ in range(9)]  # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 9:  # The start and end don't include a <td> tag
        for i, cell in enumerate(cells):
            Data[i].append(cell.text.strip().replace('"', ''))
print(Data)

【讨论】：