【问题标题】:Parsing JSON out of HTML with Beautifulsoup使用 Beautifulsoup 从 HTML 中解析 JSON
【发布时间】:2020-11-22 18:13:43
【问题描述】:
import json
import re

from bs4 import BeautifulSoup

data = """
<script data-hid="ld-json-ld.1551860" data-n-head="ssr" preserve="preserve" type="application/ld+json">{"@context":"http://schema.org","@type":"NewsArticle","mainEntityOfPage":{"@type":"WebPage","@id":"https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860"},"headline":"Plötzlich ist  das Klimaziel in Griffweite | NZZ","datePublished":"2020-04-15T12:33:47.623Z","dateModified":"2020-04-15T12:35:01.841Z","publisher":{"@type":"Organization","name":"Neue Zürcher Zeitung AG, Schweiz","url":"https://www.nzz.ch","logo":{"@type":"ImageObject","url":"https://www.nzz.ch/logo.png","width":413,"height":60},"contactPoint":[{"@type":"ContactPoint","telephone":"+41-44-2581000","contactType":"customer service"}],"sameAs":["https://www.facebook.com/nzz","https://www.twitter.com/nzz","https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ","https://www.linkedin.com/company/neue-zurcher-zeitung","https://plus.google.com/+nzz/","http://www.freebase.com/m/041b43"]},"description":"Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.","isAccessibleForFree":false,"hasPart":{"@type":"WebPageElement","isAccessibleForFree":false,"cssSelector":".regwalled"},"image":{"@type":"ImageObject","url":"https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg","width":"7050","height":"4705"},"author":{"@type":"Person","name":"David Vonplon"}}</script>"""

soup = BeautifulSoup(data, "html.parser")

pattern = re.compile(r"window.Rent.data\s+=\s+(\{.*?\});\n")
script = soup.find("script", text=pattern)

print(script)

我想解析出这段代码的 JSON 部分。但我得到的只是“无”。

如果我正在尝试这个。

soup.find("script").text

Output: ''

有人可以帮我解决我的错误吗?

它是一个更大代码的简化,它在 4、5 个月前仍在运行,现在它不再工作了,我只是不知道我做错了什么。

非常感谢。 马可

【问题讨论】:

标签: python json parsing beautifulsoup


【解决方案1】:

试试

json.loads(soup.select_one('script').string)

看看这是否有效。它适用于您问题中的&lt;data&gt;

【讨论】:

    【解决方案2】:

    find 中,将script 的任何属性作为过滤器。

    import json
    
    from bs4 import BeautifulSoup
    
    data = """
    <script data-hid="ld-json-ld.1551860" data-n-head="ssr" preserve="preserve" type="application/ld+json">{"@context":"http://schema.org","@type":"NewsArticle","mainEntityOfPage":{"@type":"WebPage","@id":"https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860"},"headline":"Plötzlich ist  das Klimaziel in Griffweite | NZZ","datePublished":"2020-04-15T12:33:47.623Z","dateModified":"2020-04-15T12:35:01.841Z","publisher":{"@type":"Organization","name":"Neue Zürcher Zeitung AG, Schweiz","url":"https://www.nzz.ch","logo":{"@type":"ImageObject","url":"https://www.nzz.ch/logo.png","width":413,"height":60},"contactPoint":[{"@type":"ContactPoint","telephone":"+41-44-2581000","contactType":"customer service"}],"sameAs":["https://www.facebook.com/nzz","https://www.twitter.com/nzz","https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ","https://www.linkedin.com/company/neue-zurcher-zeitung","https://plus.google.com/+nzz/","http://www.freebase.com/m/041b43"]},"description":"Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.","isAccessibleForFree":false,"hasPart":{"@type":"WebPageElement","isAccessibleForFree":false,"cssSelector":".regwalled"},"image":{"@type":"ImageObject","url":"https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg","width":"7050","height":"4705"},"author":{"@type":"Person","name":"David Vonplon"}}</script>"""
    
    soup = BeautifulSoup(data, "html.parser")
    
    print(json.loads(soup.find("script", {"preserve":"preserve"}).get_text(strip=True)))
    

    输出:

    {'@context': 'http://schema.org', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860'}, 'headline': 'Plötzlich ist  das Klimaziel in Griffweite | NZZ', 'datePublished': '2020-04-15T12:33:47.623Z', 'dateModified': '2020-04-15T12:35:01.841Z', 'publisher': {'@type': 'Organization', 'name': 'Neue Zürcher Zeitung AG, Schweiz', 'url': 'https://www.nzz.ch', 'logo': {'@type': 'ImageObject', 'url': 'https://www.nzz.ch/logo.png', 'width': 413, 'height': 60}, 'contactPoint': [{'@type': 'ContactPoint', 'telephone': '+41-44-2581000', 'contactType': 'customer service'}], 'sameAs': ['https://www.facebook.com/nzz', 'https://www.twitter.com/nzz', 'https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ', 'https://www.linkedin.com/company/neue-zurcher-zeitung', 'https://plus.google.com/+nzz/', 'http://www.freebase.com/m/041b43']}, 'description': 'Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.', 'isAccessibleForFree': False, 'hasPart': {'@type': 'WebPageElement', 'isAccessibleForFree': False, 'cssSelector': '.regwalled'}, 'image': {'@type': 'ImageObject', 'url': 'https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg', 'width': '7050', 'height': '4705'}, 'author': {'@type': 'Person', 'name': 'David Vonplon'}}
    

    更新:

    显然,该网页有许多带有属性保留的脚本标签。 因此,您可以按其他属性进行过滤。

    import requests, json, re
    from bs4 import BeautifulSoup
    
    res = requests.get("https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860?reduced=true")
    soup = BeautifulSoup(res.text, "html.parser")
    data = json.loads(soup.find("script",attrs={"preserve":"preserve", "data-hid":re.compile("ld-json-ld*")}).get_text(strip=True))
    
    print(data)
    

    输出:

    {'@context': 'http://schema.org', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://www.nzz.ch/schweiz/ploetzlich-ist-das-klimaziel-in-reichweite-ld.1551860'}, 'headline': 'Plötzlich ist  das Klimaziel in Griffweite | NZZ', 'datePublished': '2020-04-15T12:33:47.623Z', 'dateModified': '2020-04-15T13:49:04.823Z', 'publisher': {'@type': 'Organization', 'name': 'Neue Zürcher Zeitung AG, Schweiz', 'url': 'https://www.nzz.ch', 'logo': {'@type': 'ImageObject', 'url': 'https://www.nzz.ch/logo.png', 'width': 413, 'height': 60}, 'contactPoint': [{'@type': 'ContactPoint', 'telephone': '+41-44-2581000', 'contactType': 'customer service'}], 'sameAs': ['https://www.facebook.com/nzz', 'https://www.twitter.com/nzz', 'https://www.youtube.com/channel/UCK1aTcR0AckQRLTlK0c4fuQ', 'https://www.linkedin.com/company/neue-zurcher-zeitung', 'https://plus.google.com/+nzz/', 'http://www.freebase.com/m/041b43']}, 'description': 'Der Ausstoss an Treibhausgasen geht nur langsam zurück. Wegen der Pandemie und der warmen Witterung könnte das Klimaziel 2020 trotzdem erfüllt werden. Der Bund aber bleibt skeptisch.', 'isAccessibleForFree': False, 'hasPart': {'@type': 'WebPageElement', 'isAccessibleForFree': False, 'cssSelector': '.regwalled'}, 'image': {'@type': 'ImageObject', 'url': 'https://img.nzz.ch/O=75/https://nzz-img.s3.amazonaws.com/2020/4/15/b71dc7b9-0813-4082-9bb0-a2fd28395a67.jpeg', 'width': '7050', 'height': '4705'}, 'author': {'@type': 'Person', 'name': 'David Vonplon'}}
    

    【讨论】:

    • 谢谢大赏。使用此代码,我得到: JSONDecodeError: Expecting value: line 1 column 1 (char 0)。我的包裹可能有问题吗?
    • 这是你要抓取的页面 - nzz.ch/schweiz/…?
    • 答案仅捕获附加的 html 字符串。
    • 是的。运行挖掘脚本,每 5 分钟将所有新文章以 HTML 格式保存在 CSV 中。使用第二个脚本解析出 HTML 中的所有信息。 4 月份它起作用了,现在它给了我这个 JSONDecodeError 错误。
    • @Marco_CH 更新了我的答案。现在抓取页面并获取信息
    猜你喜欢
    • 2013-03-10
    • 1970-01-01
    • 2021-02-18
    • 2020-03-19
    • 1970-01-01
    • 2012-12-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多