【问题标题】:beautiful soup not providing a proper csv file of scraped data美丽的汤没有提供正确的刮取数据的 csv 文件
【发布时间】:2021-12-06 22:09:11
【问题描述】:

我对网络抓取还很陌生,如果我的问题的答案很明显,我深表歉意。我制作了一个 Web Scraper,它可以查看 Steam 游戏(文明 6)的评论,并获取诸如在游戏上花费的时间、他们是否推荐、他们拥有的产品等信息。

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

url = "https://steamcommunity.com/app/289070/reviews/?browsefilter=toprated&snr=1_5_100010_"

review_dict = {
    "found_helpful": [],
    "title": [], #recommended or not
    "hours": [],
    "prods_in_account": [],
    "words_in_review": []
}

def data_scrapper():
    """
    get's the reviews from the steam page.
    """
    response = requests.get(url)
    soup = bs(response.content, "html.parser")
    card_div = soup.findAll("div",attrs={"class","apphub_Card modalContentLink interactable"})

    for cards in card_div:
        found_helpful = cards.find("div", attrs={"class": "found_helpful"})
        vote_header = cards.find("div", attrs={"class": "vote_header"})
        hours = cards.find("div", attrs={"class": "hours"})
        products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"})
        words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"})

    review_dict["found_helpful"].append(found_helpful)
    review_dict["title"].append(vote_header)
    review_dict["hours"].append(hours)
    review_dict["prods_in_account"].append(products)
    review_dict["words_in_review"].append(len(words_in_review))

data_scrapper()

review_df = pd.DataFrame.from_dict(review_dict)
review_df.to_csv("review.csv", sep=",")

我的问题是,当我运行我的代码时,我期待一个有组织的 CSV 文件,但是我得到了这个:

,found_helpful,title,hours,prods_in_account,words_in_review
0,"<div class=""found_helpful"">
                3,398 people found this review helpful<br/>159 people found this review funny               <div class=""review_award_aggregated tooltip"" data-tooltip-class=""review_reward_tooltip"" data-tooltip-html='&lt;div class=""review_award_ctn_hover""&gt;             &lt;div class=""review_award"" data-reaction=""6"" data-reactioncount=""5""&gt;
                    &lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/6.png?v=5""/&gt;
                    &lt;span class=""review_award_count ""&gt;5&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div class=""review_award"" data-reaction=""3"" data-reactioncount=""3""&gt;
                    &lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/3.png?v=5""/&gt;
                    &lt;span class=""review_award_count ""&gt;3&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div class=""review_award"" data-reaction=""5"" data-reactioncount=""2""&gt;
                    &lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/5.png?v=5""/&gt;
                    &lt;span class=""review_award_count ""&gt;2&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div class=""review_award"" data-reaction=""1"" data-reactioncount=""1""&gt;
                    &lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/1.png?v=5""/&gt;
                    &lt;span class=""review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div class=""review_award"" data-reaction=""9"" data-reactioncount=""1""&gt;
                    &lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/9.png?v=5""/&gt;
                    &lt;span class=""review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div class=""review_award"" data-reaction=""18"" data-reactioncount=""1""&gt;
                    &lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/18.png?v=5""/&gt;
                    &lt;span class=""review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div class=""review_award"" data-reaction=""19"" data-reactioncount=""1""&gt;
                    &lt;img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/19.png?v=5""/&gt;
                    &lt;span class=""review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                &lt;/div&gt;'><img class=""reward_btn_icon"" src=""https://community.akamai.steamstatic.com/public/shared/images//award_icon_blue.svg""/>14</div>
</div>","<div class=""vote_header"">
<div class=""reviewInfo"">
<div class=""thumb"">
<img height=""44"" src=""https://community.akamai.steamstatic.com/public/shared/images/userreviews/icon_thumbsDown.png?v=1"" width=""44""/>
</div>
<div class=""title"">Not Recommended</div>
<div class=""hours"">8,028.3 hrs on record</div>
</div>
<div style=""clear: left""></div>
</div>","<div class=""hours"">8,028.3 hrs on record</div>","<div class=""apphub_CardContentMoreLink ellipsis"">167 products in account</div>",38

我修改了用于提取和附加数据的函数,但我仍然得到这个奇怪的文件,任何关于我做错了什么的线索?

【问题讨论】:

  • 如您所见,found_helpful 包含整个&lt;div&gt; 标签。您想从该标记中提取文本,该标记位于found_helpful.text

标签: python pandas csv web-scraping beautifulsoup


【解决方案1】:

对现有代码进行以下更改:

for cards in card_div:
    found_helpful = cards.find("div", attrs={"class": "found_helpful"}).get_text()
    vote_header = cards.find("div", attrs={"class": "vote_header"}).get_text()
    hours = cards.find("div", attrs={"class": "hours"}).get_text()
    products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"}).get_text()
    words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"}).get_text()

    review_dict["found_helpful"].append(found_helpful)
    review_dict["title"].append(vote_header)
    review_dict["hours"].append(hours)
    review_dict["prods_in_account"].append(products)
    review_dict["words_in_review"].append(len(words_in_review))

review_df = pd.DataFrame.from_dict(review_dict)
cols = review_df.select_dtypes(['object']).columns
review_df[cols] = review_df[cols].apply(lambda x: x.str.strip())

输出:

                                       found_helpful                                   title                  hours         prods_in_account  words_in_review
0  1,266 people found this review helpful20 peopl...        Recommended\n456.9 hrs on record    456.9 hrs on record  536 products in account              770
1  1,127 people found this review helpful14 peopl...         Recommended\n92.1 hrs on record     92.1 hrs on record  135 products in account              574
2  853 people found this review helpful49 people ...      Recommended\n1,360.8 hrs on record  1,360.8 hrs on record   18 products in account              181
3  1,832 people found this review helpful18 peopl...        Recommended\n520.5 hrs on record    520.5 hrs on record  281 products in account             7114
4  3,370 people found this review helpful40 peopl...    Not Recommended\n415.7 hrs on record    415.7 hrs on record  102 products in account              853
5  5,724 people found this review helpful172 peop...    Not Recommended\n256.7 hrs on record    256.7 hrs on record  180 products in account             2072
6  393 people found this review helpful10 people ...         Recommended\n22.8 hrs on record     22.8 hrs on record   85 products in account              278
7  3,229 people found this review helpful62 peopl...     Not Recommended\n58.6 hrs on record     58.6 hrs on record  264 products in account              894
8  1,373 people found this review helpful22 peopl...    Not Recommended\n195.3 hrs on record    195.3 hrs on record   75 products in account              556
9  3,398 people found this review helpful159 peop...  Not Recommended\n8,028.8 hrs on record  8,028.8 hrs on record  167 products in account             8007

【讨论】:

  • 感谢您的帮助。它确实解决了我的一半问题(正确格式化了我的 CSV 文件并提供了一些数据),我需要找到正确的 HTML 代码来提取“标题”和“发现有用”数据。
猜你喜欢
  • 2021-01-15
  • 1970-01-01
  • 2020-12-13
  • 2019-03-13
  • 2014-05-28
  • 2020-09-28
  • 2021-11-19
  • 2018-12-29
相关资源
最近更新 更多