【问题标题】:Extracting data from Script tags in HTML using Python使用 Python 从 HTML 中的脚本标签中提取数据
【发布时间】:2020-12-07 05:56:12
【问题描述】:

我正在尝试获取这些脚本标签中的数据,但我似乎无法转换为 json,因此我可以在阅读后对其进行解析。我感兴趣的数据是名称、图片、sku 和价格。

HTML:

<script type="application/ld+json">
        {
          "@context": "http://schema.org/",
          "@type": "Product",
          "name": "Key Pouch",
          "image": "https://us.louisvuitton.com/images/is/image/lv/1/PP_VP_L/louis-vuitton-key-pouch-monogram-gifts-for-women--M62650_PM2_Front view.jpg",
          "description": "The Key Pouch in iconic Monogram canvas is a playful yet practical accessory that can carry coins, cards, folded bills and other small items, in addition to keys. Secured with an LV-engraved zip, it can be hooked onto the D-ring inside most Louis Vuitton bags, or used as a bag or belt charm.",
          "sku": "M62650",
          "brand": {
            "@type": "Thing",
            "name": "LOUIS VUITTON"
          },
          "offers": {
            "@type": "Offer",
            "url" : "https://us.louisvuitton.com/eng-us/products/key-pouch-monogram-000941",
            "priceCurrency": "USD",
            "price": "215.00",
            "availability": "http://schema.org/OutOfStock",
            "seller": {
              "@type": "Organization",
              "name": "LOUIS VUITTON"
            }
          }
        }
</script>

代码:

from bs4 import BeautifulSoup as soup
import requests 
import json

HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}

req = Request("https://us.louisvuitton.com/eng-us/products/key-pouch-monogram-000941", headers= HEADERS)
webpage = urlopen(req).read()

page_soup = soup(webpage, "html.parser")
data = json.loads(page_soup.find('script', type='application/ld+json').text)

print(data)

输出

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

任何帮助将不胜感激。

【问题讨论】:

标签: python html json beautifulsoup python-requests


【解决方案1】:

来自https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text的文档

从 Beautiful Soup 4.9.0 版开始,当使用 lxml 或 html.parser 时,

所以使用 html5lib。 一个可行的解决方案如下:

from bs4 import BeautifulSoup as soup
import requests
import json

HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}
req = requests.get("https://us.louisvuitton.com/eng-us/products/key-pouch-monogram-000941", headers= HEADERS)
page_soup = soup(req.text, "html5lib")
data = json.loads(page_soup.find('script', type='application/ld+json').text)
print(data)

输出:

{'@context': 'http://schema.org/', '@type': 'Product', 'name': 'Key Pouch', 'image': 'https://us.louisvuitton.com/images/is/image/lv/1/PP_VP_L/louis-vuitton-key-pouch-monogram-gifts-for-women--M62650_PM2_Front view.jpg', 'description': 'The Key Pouch in iconic Monogram canvas is a playful yet practical accessory that can carry coins, cards, folded bills and other small items, in addition to keys. Secured with an LV-engraved zip, it can be hooked onto the D-ring inside most Louis Vuitton bags, or used as a bag or belt charm.', 'sku': 'M62650', 'brand': {'@type': 'Thing', 'name': 'LOUIS VUITTON'}, 'offers': {'@type': 'Offer', 'url': 'https://us.louisvuitton.com/eng-us/products/key-pouch-monogram-000941', 'priceCurrency': 'USD', 'price': '215.00', 'availability': 'http://schema.org/OutOfStock', 'seller': {'@type': 'Organization', 'name': 'LOUIS VUITTON'}}}

【讨论】:

猜你喜欢
  • 2018-06-10
  • 2019-07-26
  • 2020-01-07
  • 2021-10-22
  • 1970-01-01
  • 2020-06-18
  • 2020-02-19
  • 1970-01-01
  • 2021-01-09
相关资源
最近更新 更多