如何针对 bs4 抓取特定的维基百科表格元素？答案

【问题标题】：How to target a specific Wikipedia table element for bs4 scrape?如何针对 bs4 抓取特定的维基百科表格元素？
【发布时间】：2020-01-27 20:07:10
【问题描述】：

到目前为止，这是我的代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable sortable'})

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://en.wikipedia.org/wiki/2019%E2%80%9320_Wuhan_coronavirus_outbreak'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
page_soup.tbody.tr?

我正在尝试定位此表格元素，但它不是唯一的。如何捕获这个名为“

我可以做 page_soup.h1 来获取所有 h1 标签的东西，但是这里有很多重复的标签，我可以使用一些帮助。我做了UTFSE，但仍然很困惑。感谢您的宝贵时间。

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

如果我正确理解了您的问题，您可以尝试以下方法：

url = 'https://en.wikipedia.org/wiki/2019%E2%80%9320_Wuhan_coronavirus_outbreak'
import requests
from bs4 import BeautifulSoup as bs
resp = requests.get(url)


soup = bs(resp.text,'lxml')

tabs = soup.find('table',{'class':'wikitable sortable'})
tot = tabs.find_all('tr',{'style':'vertical-align:top'})
for t in tot:    
    rows = t.find_all('td',style=None)
    for r in rows:
        if r.text.strip() == "Total":
            print(m.nextSibling.text)

其背后的想法是目标编号2903位于带有（剥离）的一行之后发短信Total。 Total 这个词在一个没有 style 属性的 td 标记中。我们找到该标签，目标编号在其直接兄弟的文本中。

输出：

2,903

【讨论】：

哇，这太棒了。你是怎么学会这样做的？很棒的东西。
@MattNewtonian - 花了一段时间......很高兴它对你有用！

【解决方案2】：

您可以使用正则表达式查找文本total，然后找到find_next('b')

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
my_url = 'https://en.wikipedia.org/wiki/2019%E2%80%9320_Wuhan_coronavirus_outbreak'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "lxml")
my_table = page_soup.find('table',{'class':'wikitable sortable'})
item=my_table.find('b',text=re.compile('Total')).find_next('b').text
print(item)

输出：

2,903

【讨论】：