使用 BeautifulSoup 通过标签类迭代 html答案

【问题标题】：Iterating html through tag classes with BeautifulSoup使用 BeautifulSoup 通过标签类迭代 html
【发布时间】：2017-11-20 08:07:29
【问题描述】：

我正在将网页中的一些特定标签保存到 Excel 文件中，所以我有以下代码：

`import requests
from bs4 import BeautifulSoup
import openpyxl

url = "http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")

wb = openpyxl.Workbook()
ws = wb.active

tagiterator = soup.h2

row, col = 1, 1
ws.cell(row=row, column=col, value=tagiterator.getText())
tagiterator = tagiterator.find_next()

while tagiterator.find_next():
    if tagiterator.name == 'h2':
        row += 1
        col = 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
    elif tagiterator.name == 'span':
        col += 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
tagiterator = tagiterator.find_next()

wb.save('DG3test.xlsx')`

它有效，但我想排除一些标签。我只想获得具有“产品名称”类的 h2 标签和具有“属性值”类的跨度标签。我试图通过以下方式做到这一点：

tagiterator['class'] == 'product-name'

tagiterator.hasClass('product-name')

tagiterator.get

还有一些也没有用。

在我创建的这张糟糕的图片中可以看到我想要的值：https://ibb.co/eWLsoQ 并且 url 在代码中。

【问题讨论】：

标签： html python-2.7 beautifulsoup

【解决方案1】：

我所做的不包括将其写入一个 excel 文件，希望，这是你可以做的事情，不过，只需写一个评论，我会包括为此的代码。逻辑适用，写入产品信息，添加 row+=1 和 column 然后重置列...（我们为什么要这样做？所以产品保持在同一行内:)。 你已经做过的事情

from bs4 import BeautifulSoup

import requests

header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}


url = requests.get("http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml", headers=header).text
soup = BeautifulSoup(url, 'lxml')

find_products = soup.findAll('div',{'class':'product-row'})

for item in find_products:
    title_text = item.find('div',{'class':'product-header'}).h2.a.text.strip() #Finds the title / name of product
    # print(title_text)
    display = item.find('span',{'class':'attribute-value'}).text.strip() #Finds for example the this text 49 cali, Full HD, 1920 x 1080
    # print(display)
    functions_item = item.findAll('span',{'class':'attribute-value'})[1] #We find now the functions or the 'Funkcje'
    list_of_funcs = functions_item.findAll('a') #We find the list of the functions e.g. wifi
    #Now you can store them or do-smt...

    for funcs in list_of_funcs:
        print(funcs.text.strip())

算法：

我们找到每个产品
我们在每个产品中找到标签并提取相关信息
我们使用.text 仅提取文本部分
我们使用 for 循环遍历每个产品，然后遍历它们的 Functions 或包含产品功能的标签。

【讨论】：

非常好，但是没有找到所有产品的wifi，HDMI，USB等功能，只有两个型号的LG。我在list_of_funcs中添加了funcs来列出并打印它，任何人都可以除了这两个 LG 之外，它还有“Picture Mastering (or Performace) Index”，这是网站上方的一行。
你运行了它，它确实打印了那些......WIFI之类的？也许有不同的结构，但你可以应用 find 函数来做同样的事情，给我一个 url，我会修复它..
我确实在http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml 上运行过它，我明白你在说什么，它只是与页面结构有关，但稍后会修复它。
今天我尝试了一些东西，但仍然无法正常工作。
我不知道我今晚是否能够解决这个问题，但是你需要做的是而不是functions_item = item.findAll('span',{'class':'attribute-value'})[1] 而不是 [1] 遍历所有并放入一个用于搜索 div 属性值并找到包含 USB 等元素的 IF