【问题标题】:Scraping specific tag and keyword, printing info associated with it using BeautifulSoup抓取特定的标签和关键字,使用 BeautifulSoup 打印与之相关的信息
【发布时间】:2016-11-23 03:52:01
【问题描述】:

我正在尝试为蓝宝石眼线笔产品抓取 https://store.fabspy.com/collections/new-arrivals-beauty,并返回与产品 ID 关联的信息。到目前为止,我有:

from bs4 import BeautifulSoup
import urllib2
url = 'https://store.fabspy.com/collections/new-arrivals-beauty'
page = BeautifulSoup(url.read())
soup = BeautifulSoup((page))
tag = 'div class="product-content"'
if row in soup.html.body.findAll(tag):
    data = row.findAll('id')
    if data and 'sapphire' in data[0].text:
        print data[4].text

我试图接收的信息如下;

<div class="product-content">
    <div class="pc-inner"> 
      <div data-handle="clematis-dewdrop-sparkling-eye-pencil-g7454c-sapphire" 
           data-target="#quick-shop-popup"
           class="quick_shop quick-shop-button"
           data-toggle="modal"
           title="Quick View">
        <span>+ Quick View</span>
        <span class="json hide">
          {
            "id":8779050374,
            "title":"Clematis - Dewdrop Sparkling Gel Eye Liner Pencil # G7454C**Sapphire**",
            "handle":"clematis-dewdrop-sparkling-eye-pencil-g7454c-sapphire",
            "description":"\u003cdiv\u003e\r\n\r\nGel Formula, Rich Colour, Matte Finish, Long-Wearing, Safe for Waterline\r\n\r\n\u003cbr\u003e\n\u003c\/div\u003e\u003cdiv\u003e\u003cbr\u003e\u003c\/div\u003e \u003cimg alt=\"\" src=\"\/\/i.imgur.com\/adW5MKl.jpg\"\u003e",
            "published_at":"2016-10-17T20:15:40+08:00",
            "created_at":"2016-10-17T20:15:40+08:00",
            "vendor":"Clematis",
            "type":"Latest,Beauty,New,Makeup,Best, Clematis, Eyes",
            "tags":["Beauty","Best","Clematis","Eyes","Latest","Makeup","New"],
            "price":4900,
            "price_min":4900,
            "price_max":4900,
            "available":true,
            "price_varies":false,
            "compare_at_price":7900,
            "compare_at_price_min":7900,
            "compare_at_price_max":7900,
            "compare_at_price_varies":false,
            "variants":[{"id":31447937030", "title":"N\/A"]
          }

特别是末尾的id。请指定我的脚本应该关注哪个标签来检索此信息,以及我如何在脚本中关键字搜索sapphire 颜色及其id,谢谢!

【问题讨论】:

  • 需要重点获取span里面的文字,用class="json hide"和JSON解析文字

标签: python html css web-scraping


【解决方案1】:

现有代码中存在一些错误。我建议使用requests 而不是urllib2。我也在使用rejson 库。所以这就是我在你的情况下会做的,(阅读代码以获得解释)。

from bs4 import BeautifulSoup
import requests
import re
import json
# URL to scrape
url = 'https://store.fabspy.com/collections/new-arrivals-beauty'

# HTML data of the page
# You can add checks for 404 errors
soup = BeautifulSoup(requests.get(url).text, "lxml")

# Get a list of all elements having `sapphire` in the `data-handle` attribute
sapphire = soup.findAll(attrs={'data-handle': re.compile(r".*sapphire.*")})
# Take first element of this list (I checked, there is just one element)
sapphire = sapphire[0]

# Find class inside this element having JSON data. Taking just first element's text
json_text = sapphire.findAll(attrs={'class': "json"})[0].text

# Converting it to a dictionary
data = json.loads(json_text)
print data["id"]

【讨论】:

  • 如何在不知道元素数量的情况下抓取“数据句柄”元素?说当你设置 sapphire = sapphire[0] 假设我永远不会知道页面将有多少元素,对于 json_text 行也是如此,谢谢!
  • 您始终可以使用len() 函数检查findAll() 函数的长度。 (这只是一个普通的旧列表)。在这种情况下,我打印了列表并意识到它只有一个元素
  • 我将只搜索字符串,而不是搜索“id”属性,因为在这个特定页面上我正在抓取没有 ID 属性。如果(关键字)在蓝宝石中,就使用,打印蓝宝石[0]。尽管如此,这里是代码,得到索引错误:from bs4 import BeautifulSoup import requests import re import json import lxml url = 'https://packershoes.com/' soup = BeautifulSoup(requests.get(url).text, "lxml") sapphire = re.findall(r'&lt;script type="text/javascript&gt;"+.*&lt;/script&gt;+', str(soup), re.I|re.M) print(sapphire[0]) nicer--> link
  • ^^之前没有标记你
  • 您确定您已阅读并理解答案中提供的代码吗?我已经在您的页面上运行了代码,它会获取您在问题中要求的id。它不寻找任何“id”属性,而是从 JSON 中提取它。如果你想得到别的东西,请明确告诉我。您的评论毫无意义
猜你喜欢
  • 2019-07-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-07-05
  • 2013-07-26
  • 2015-12-02
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多