【问题标题】:To add values to items that do not exist when web crawling为网页抓取时不存在的项目添加值
【发布时间】:2019-10-19 00:54:54
【问题描述】:

我正在从 IMDB 中提取评论数据。

但是,有时也有没有排名的数据。

我想将此类数据的 Rank 视为 0 并将其添加到数组中。

我不知道怎么做。

你能帮帮我吗?

非常感谢!

web image

这样提取时,Rank值较低。

for star in soup.select('span:has(~ .point-scale)'):
    Star.append(star.text.strip());

for title in soup.find_all('a', {'class' : 'title'}):
    Title.append(title.text.strip())

for content in soup.find_all(True,{'class' :[text show-more__control'
                                   ,'text show-more__control clickable]}):
    Content.append(content.text.strip())
    print(range(len(Content)))

len(list : rank, title, content)

How elements in a site fit into a list.

【问题讨论】:

  • 请在您的问题文本中包含任何必要的信息。您的代码图片不合适。见Why not upload images of code?
  • 嗨 Cyp,除了“web image”,其他两个 sn-ps 可以很容易地作为文本包含在您的问题中。请这样做,正确格式化它们,然后(再次在您的问题中)解释“[评级]部分缺失”的意思;而你所期望的。只有这样我们才能尝试回答您的问题。
  • @khelwood,@minsago 抱歉,我添加了更多解释。

标签: python beautifulsoup web-crawler


【解决方案1】:

并非所有评论都有评分,因此您需要考虑到这一点:

$ python3 test.py https://www.imdb.com/title/tt5113040/reviews
Got response: 200
Title: The Secret Life of Pets 2
# (8/10) Not as bad as some reviews on here
Let's get this straight it a film made for childre...
-----
ddriver385, 26 May 2019

# (7/10) A Good Film for the kids
This film is a good film to watch with the kids. C...
-----
xxharriet_hobbsxx, 27 May 2019

# (7/10) Worth a watch
Admittedly, it probably wasn't necessary to follow...
-----
MythoGenesis, 24 May 2019

# (No rating) Intense and entertaining
Narratively, the film is not without fault. In par...
-----
TheBigSick, 26 May 2019
...

test.py

import requests
import sys
import time

from bs4 import BeautifulSoup


def fetch(url):
    with requests.Session() as s:
        r = s.get(url, timeout=5)
        return r


def main(url):
    start_t = time.time()
    resp = fetch(url)
    print(f'Got response: {resp.status_code}')
    html = resp.content
    bs = BeautifulSoup(html, 'html.parser')
    title = bs.find('h3', attrs={'itemprop': 'name'})
    print(f'Title: {title.a.text}')
    reviews = bs.find_all('div', class_='review-container')
    for review in reviews:
        title = review.find('a', class_='title').text.strip()
        rating = review.find('span', class_='rating-other-user-rating')
        if rating:
            rating = ''.join(i.text for i in rating.find_all('span'))
        rating = rating if rating else 'No rating'
        user = review.find('span', class_='display-name-link').text
        date = review.find('span', class_='review-date').text
        content = review.find('div', class_='content').div.text
        print(
            f'# ({rating}) {title}\n'
            f'{content[:50]}...\n'
            f'{"-" * 5}\n'
            f'{user}, {date}\n'
        )
    end_t = time.time()
    elapsed_t = end_t - start_t
    r_time = resp.elapsed.total_seconds()
    print(f'Total: {elapsed_t:.2f}s, request: {r_time:.2f}s')


if __name__ == '__main__':
    if len(sys.argv) > 1:
        url = sys.argv[1]
        main(url)
    else:
        print('URL is required.')
        sys.exit(1)

【讨论】:

    猜你喜欢
    • 2020-08-30
    • 2019-04-07
    • 2018-09-22
    • 1970-01-01
    • 1970-01-01
    • 2020-06-18
    • 1970-01-01
    • 2021-08-17
    相关资源
    最近更新 更多