如何使用 BeautifulSoup 提取原始价格？答案

【问题标题】：How to extract original prices using BeautifulSoup?如何使用 BeautifulSoup 提取原始价格？
【发布时间】：2020-06-03 20:30:00
【问题描述】：

我正在尝试学习 BeatifulSoup，但目前无法提取价格（尤其是在有折扣/删除的情况下）。我只在有折扣时才感兴趣（itemprop = "offers"），对于这个练习，我只想提取原始价格。

可以通过检查此页面获得完整的 HTML： https://www.patagonia.ca/shop/mens-hard-shell-jackets-vests

在下面的 HTML 中突出显示所需的目标：

我试过了

from bs4 import BeautifulSoup
import requests
import pandas as pd
import json

page = requests.get("https://www.patagonia.ca/shop/mens-hard-shell-jackets-vests", verify = False)
soup = BeautifulSoup(page.content, 'html.parser')

div_price = []

for section_tag in soup.find_all('div', class_='product-tile__meta-primary'):
    for div_prices in section_tag.find_all('div', class_='price'):
        if div_prices.get('itemprop') == 'offers':
            for x in div_prices.find_all('span', {'class':'strike-through list'}):        
                for y in x.find_all('span', class_='value'):
                    div_price.append(y.get('content'))
        else:
            continue

上面的代码给了我想要的价格——我只想要原价（499 美元），不是折扣价（349.30 美元）——但是它会重复多次:(

['499.00', '435.00', '879.00', '999.00', '799.00', '499.00', '435.00', '879.00', '999.00', '799.00', '499.00', '435.00', '879.00', '999.00', '799.00', '499.00', '435.00', '879.00', '999.00', . . .

此外，我对嵌套循环并不感到自豪，我希望社区可以帮助修复这两个错误（感觉就像我在这里遗漏了一些简单的东西，但我无法理解它）：

如果有更好的方法不使用所有循环，我会全力以赴
除了继续使用 find_all（仍在 BeautifulSoup 中）之外，还有更好的方法来提取所需信息吗？

【问题讨论】：

标签： python loops parsing web-scraping beautifulsoup

【解决方案1】：

您可以通过 attr itemprop 在目标跨度上直接找到价格。对于有2个价格的夹克，我使用find方法，所以它只需要第一个找到的span，所以价格没有折扣

你可以这样做：

from bs4 import BeautifulSoup
import requests
import pandas as pd
import json

page = requests.get("https://www.patagonia.ca/shop/mens-hard-shell-jackets-vests")
soup = BeautifulSoup(page.content, 'html.parser')

div_price = []
# Loop on elements
for jacket in soup.find_all('div', {'class':'product-tile__content'}):
    span_price = jacket.find('span', {'itemprop': 'price'})
    if span_price:
        div_price.append(span_price.get('content'))
print(div_price)

结果：

['189', '189', '189', '189', '189', '189', '189', '189', '189', '189', '435', '499.00', '435.00', '189', '879.00', '249', '499', '999.00', '799.00', '249', '749', '499', '159', '879', '685', '499', '315', '625', '169', '625', '475', '435', '599', '375', '315', '625', '499', '315']

【讨论】：

正是我需要的！感谢@Maaz，这也是一个有用的框架，因为该网站实际上也有不同的价格（折扣价除外）

【解决方案2】：

for price in soup.select('.price'):

    if price.select('.strike-through.list'):
        print(price.select('.strike-through.list'))
    else:
        print(price.select('.sales'))

此代码应该为您提供删除列表类的价格（如果可用），否则它将获得原始的“销售”类价格。

跨度逻辑应该根据您的代码工作。

希望这会有所帮助。

【讨论】：

什么是选择功能？它怎么知道要遍历哪个类？
select 是 bs4 中的 CSS 选择器，更多信息请参考 crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors。

【解决方案3】：

代码：

from bs4 import BeautifulSoup
import requests

page = requests.get("https://www.patagonia.ca/shop/mens-hard-shell-jackets-vests", verify = False)
soup = BeautifulSoup(page.content, 'html.parser')

div_price = []
for price in soup.find_all('span', {'class': 'strike-through list'}):
    div_price.append(str(price.text).strip()[3:])

print(div_price)

输出：

['499', '435', '879', '999', '799']

【讨论】：