无法使用请求解析来自网页的评级信息答案

【问题标题】：Unable to parse a rating information from a webpage using requests无法使用请求解析来自网页的评级信息
【发布时间】：2020-07-22 19:20:39
【问题描述】：

我试图从网页中抓取某些信息，但失败了。我希望抓取的文本在页面源中可用，但我仍然无法获取它。这是site address。我在图像中可见的部分之后是Not Rated。

相关html：

<div class="subtext">
                    Not Rated
    <span class="ghost">|</span>                    <time datetime="PT188M">
                        3h 8min
                    </time>
    <span class="ghost">|</span>
<a href="/search/title?genres=drama&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Drama</a>, 
<a href="/search/title?genres=musical&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Musical</a>, 
<a href="/search/title?genres=romance&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Romance</a>
    <span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a>            </div>

我试过了：

import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    rating = soup.select_one(".titleBar .subtext").next_element
    print(rating)

我使用上面的脚本没有得到任何结果。

预期输出：

Not Rated

如何从该网页获得评分？

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup python-requests

【解决方案1】：

如果要获取正确版本的 HTML 页面，请指定 Accept-Language http header：

import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    s.headers['Accept-Language'] = 'en-US,en;q=0.5'  # <-- specify also this!
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    rating = soup.select_one(".titleBar .subtext").next_element
    print(rating)

打印：

            Not Rated

【讨论】：

【解决方案2】：

有一种更好的方法可以在页面上获取信息。如果你转储了请求返回的html内容。

import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    with open("response.html", "w", encoding=r.encoding) as file:
            file.write(r.text)

您会找到一个元素 <script type="application/ld+json">，其中包含有关电影的所有信息。
然后，您只需获取元素文本，将其解析为 json，并使用 json 提取您想要的信息。
这是一个工作示例

import json
import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next  # Find the element <script type="application/ld+json"> and get it's content
    movie_data = json.loads(movie_data)  # parse the data to json
    content_rating = movie_data["contentRating"]  # get rating

【讨论】：

我一开始就注意到了，但是我尝试的方式有什么问题，因为它在页面源中可用。

【解决方案3】：

IMDB 是让网页抓取变得异常简单的网页之一，我喜欢它。所以他们为了方便网络爬虫所做的就是在 html 的顶部放置一个脚本，其中包含 JSON 格式的整个电影对象。

因此，要获取所有相关信息并对其进行组织，您只需获取该单个脚本标签的内容，并将其转换为 JSON，然后您可以像使用字典一样简单地询问具体信息。

import requests
import json
from bs4 import BeautifulSoup

#This part is basically the same as yours
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
r = requests.get(link)
soup = BeautifulSoup(r.content,"lxml")

#Why not get the whole json element of the movie?
script = soup.find('script', {"type" : "application/ld+json"})
element = json.loads(script.text)

print(element['contentRating'])
#Outputs "Not Rated"


# You can also inspect te rest of the json it has all the relevant information inside
#Just -> print(json.dumps(element, indent=2))

注意：在此示例中，标头和会话不是必需的。

【讨论】：