【问题标题】:Scraping script tags where content-Type is application/ld+json抓取 content-Type 为 application/ld+json 的脚本标签
【发布时间】:2021-10-15 23:03:36
【问题描述】:

错误在 jsn = json.loads(data.string) 中。我想刮掉评论者和评分,但getting string as attribute error。你能帮帮我吗?

代码:

from bs4 import BeautifulSoup
import json
import requests
import pandas as pd

r= requests.get('https://www.zomato.com/beirut/divvy-ashrafieh/reviews')
soup = BeautifulSoup(r.text, "lxml")


data = soup.find('script', {"type": "application/ld+json"})
jsn = json.loads(data.string)

print(jsn)

【问题讨论】:

  • 试试str(data.string)。 data.string 仍然是一个 bs4 对象,如果 data 不是 None 则为 NavigableString
  • json.loads(data.text.strip()) ?
  • 两种方法都不起作用

标签: python json web-scraping beautifulsoup


【解决方案1】:

尝试设置User-Agent HTTP 头:

from bs4 import BeautifulSoup
import json
import requests
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}

r = requests.get(
    "https://www.zomato.com/beirut/divvy-ashrafieh/reviews", headers=headers
)
soup = BeautifulSoup(r.text, "lxml")

all_data = soup.find_all("script", {"type": "application/ld+json"})

for data in all_data:
    jsn = json.loads(data.string)
    print(json.dumps(jsn, indent=4))

打印:

{
    "@context": "http://schema.org",
    "@type": "WebSite",
    "name": "Zomato",
    "url": "https://www.zomato.com"
}
{
    "@context": "https://schema.org",
    "@type": "Restaurant",
    "name": "DIVVY",
    "url": "/beirut/divvy-ashrafieh/reviews",
    "openingHours": "12noon \u2013 11:30pm (Today)",
    "hasmap": "https://maps.zomato.com/php/staticmap?center=33.8882180000,35.5199140000&maptype=zomato&markers=33.8882180000,35.5199140000,pin_res32&sensor=false&scale=2&zoom=16&language=en&size=240x150&size=400x240",
    "menu": "/beirut/divvy-ashrafieh/reviews/menu",
    "address": {
        "@type": "PostalAddress",
        "streetAddress": "ABC Ashrafieh, Level 3, Furn el Hayek Street, Ashrafieh, Beirut District",
        "addressLocality": "ABC Ashrafieh, Beirut District",
        "addressRegion": "Beirut District",
        "postalCode": "",
        "addressCountry": "Lebanon"
    },

...and so on.

【讨论】:

    猜你喜欢
    • 2021-02-25
    • 2016-05-21
    • 1970-01-01
    • 1970-01-01
    • 2015-08-30
    • 2020-10-05
    • 2012-07-11
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多