【问题标题】:Unable to parse a rating information from a webpage using requests无法使用请求解析来自网页的评级信息
【发布时间】:2020-07-22 19:20:39
【问题描述】:

我试图从网页中抓取某些信息,但失败了。我希望抓取的文本在页面源中可用,但我仍然无法获取它。这是site address。我在图像中可见的部分之后是Not Rated

相关html:

<div class="subtext">
                    Not Rated
    <span class="ghost">|</span>                    <time datetime="PT188M">
                        3h 8min
                    </time>
    <span class="ghost">|</span>
<a href="/search/title?genres=drama&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Drama</a>, 
<a href="/search/title?genres=musical&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Musical</a>, 
<a href="/search/title?genres=romance&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Romance</a>
    <span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a>            </div>

我试过了:

import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    rating = soup.select_one(".titleBar .subtext").next_element
    print(rating)

我使用上面的脚本没有得到任何结果。

预期输出:

Not Rated

如何从该网页获得评分?

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup python-requests


    【解决方案1】:

    如果要获取正确版本的 HTML 页面,请指定 Accept-Language http header:

    import requests
    from bs4 import BeautifulSoup
    
    link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
    
    with requests.Session() as s:
        s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
        s.headers['Accept-Language'] = 'en-US,en;q=0.5'  # <-- specify also this!
        r = s.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        rating = soup.select_one(".titleBar .subtext").next_element
        print(rating)
    

    打印:

                Not Rated
    

    【讨论】:

      【解决方案2】:

      有一种更好的方法可以在页面上获取信息。如果你转储了请求返回的html内容。

      import requests
      from bs4 import BeautifulSoup
      
      link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
      
      with requests.Session() as s:
          s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
          r = s.get(link)
          soup = BeautifulSoup(r.text,"lxml")
          with open("response.html", "w", encoding=r.encoding) as file:
                  file.write(r.text)
      

      您会找到一个元素 &lt;script type="application/ld+json"&gt;,其中包含有关电影的所有信息。
      然后,您只需获取元素文本,将其解析为 json,并使用 json 提取您想要的信息。
      这是一个工作示例

      import json
      import requests
      from bs4 import BeautifulSoup
      
      link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
      
      with requests.Session() as s:
          s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
          r = s.get(link)
          soup = BeautifulSoup(r.text,"lxml")
          movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next  # Find the element <script type="application/ld+json"> and get it's content
          movie_data = json.loads(movie_data)  # parse the data to json
          content_rating = movie_data["contentRating"]  # get rating
      

      【讨论】:

      • 我一开始就注意到了,但是我尝试的方式有什么问题,因为它在页面源中可用。
      【解决方案3】:

      IMDB 是让网页抓取变得异常简单的网页之一,我喜欢它。所以他们为了方便网络爬虫所做的就是在 html 的顶部放置一个脚本,其中包含 JSON 格式的整个电影对象。

      因此,要获取所有相关信息并对其进行组织,您只需获取该单个脚本标签的内容,并将其转换为 JSON,然后您可以像使用字典一样简单地询问具体信息。

      import requests
      import json
      from bs4 import BeautifulSoup
      
      #This part is basically the same as yours
      link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
      r = requests.get(link)
      soup = BeautifulSoup(r.content,"lxml")
      
      #Why not get the whole json element of the movie?
      script = soup.find('script', {"type" : "application/ld+json"})
      element = json.loads(script.text)
      
      print(element['contentRating'])
      #Outputs "Not Rated"
      
      
      # You can also inspect te rest of the json it has all the relevant information inside
      #Just -> print(json.dumps(element, indent=2))
      

      注意: 在此示例中,标头和会话不是必需的。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2012-04-20
        • 1970-01-01
        • 2020-02-06
        • 1970-01-01
        • 2019-03-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多