【问题标题】:Using BS4 I have an issue when scraping Trustpilot review dates使用 BS4 在抓取 Trustpilot 审核日期时遇到问题
【发布时间】:2020-01-28 09:38:30
【问题描述】:

鉴于我的以下代码,我无法获得评级和相应的日期。

我可以获得评级,但不能使用 .text。它得到了整个结果:

</div>, <div class="star-rating star-rating--medium">
<img alt="5 stars: Excellent" src="//cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg"/>

这意味着我有一些清洁工作要做,但我确信只有“5 星:优秀”是可能的。只是不知道该怎么做。

至于日期,我的 "date = star.find("div", attrs={"class":"tooltip-container-1"})" 行只得到 None 值,我不知道为什么.

请在下面查看我的代码、评分的 HTML 和日期。

我的代码:

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
#def get_total_items(url):

#soup = BeautifulSoup(requests.get(url, format(0),headers).text, 'lxml')
stars = []
dates = []
with requests.Session() as s: 
    for num in range(1,2):
        url = "https://www.trustpilot.com/review/www.boozt.com?page={}".format(num)
        r = s.get(url, headers = headers)
        soup = BeautifulSoup(r.content, 'lxml')

        for star in soup.find_all("section", attrs={"class":"review__content"}):
            rating = star.find("div", attrs={"class":"star-rating star-rating--medium"}) 
            date = star.find("div", attrs={"class":"tooltip-container-1"})
            #print(rating)
            stars.append(rating)
            dates.append(date)
        #data = {"Rating": stars, "Dates": dates}
        time.sleep(2)
print(dates) 

来自 Trustpilot 的评级 html:

<div class="star-rating star-rating--medium">
    <img src="//cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg" alt="5 stars: Excellent">
</div>

来自 Trustpilot 的日期 html:

<div class="v-popover">
    <span aria-describedby="popover_o7e1fd7whi" class="trigger" style="display: inline-block;">
        <time datetime="2020-01-20T10:09:54.000Z" title="Monday, January 20, 2020, 11:09:54 AM" class="review-date--tooltip-target">Jan 20, 2020</time> 
        <div class="tooltip-container-1"></div> <!----></span> </div>

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    首先,要获得评级值,例如“5 星:优秀”,您只需读取 img 下的 img 属性 divstar-rating star-rating--medium

    然后,要获取日期值,这有点棘手,因为您的目标日期是由 javascript 加载的。但是你可以从上面的script 标签中得到它。像这样:star.find('script')

    我对你的代码 sn-p 做了一些更新,我们在这里:

    代码:

    import requests
    from bs4 import BeautifulSoup
    import time
    import json
    
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
    #def get_total_items(url):
    
    #soup = BeautifulSoup(requests.get(url, format(0),headers).text, 'lxml')
    stars = []
    dates = []
    results = []
    with requests.Session() as s:
        for num in range(1,2):
            url = "https://www.trustpilot.com/review/www.boozt.com?page={}".format(num)
            r = s.get(url, headers = headers)
            soup = BeautifulSoup(r.content, 'lxml')
    
            for star in soup.find_all("section", {"class":"review__content"}):
    
                # Get rating value
                rating = star.find("div", {"class":"star-rating star-rating--medium"}).find('img').get('alt')
    
                # Get date value
                date_json = json.loads(star.find('script').text)
                date = date_json['publishedDate']
    
                stars.append(rating)
                dates.append(date)
    
                data = {"Rating": rating, "Date": date}
                results.append(data)
    
            time.sleep(2)
    
    
    print(results)
    

    结果:

    [{'Date': '2020-01-28T05:37:13Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-28T00:00:48Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T23:22:58Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T21:20:32Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T21:06:42Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T19:37:16Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T19:27:38Z', 'Rating': '2 stars: Poor'},
     {'Date': '2020-01-27T18:20:48Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T17:18:42Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T16:15:17Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T15:58:49Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T15:46:29Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T15:39:23Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T15:32:43Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T15:29:21Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T15:27:30Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T14:35:29Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T13:43:40Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T13:37:53Z', 'Rating': '5 stars: Excellent'},
     {'Date': '2020-01-27T12:58:58Z', 'Rating': '5 stars: Excellent'}]
    

    【讨论】:

      【解决方案2】:

      评分在图像标签内,日期在脚本标签内。需要获取scripts标签的文本并加载到json中,然后获取json的key值。

      使用以下 css 选择器。

      import json
      headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
      stars = []
      dates = []
      with requests.Session() as s:
          for num in range(1,2):
              url = "https://www.trustpilot.com/review/www.boozt.com?page={}".format(num)
              r = s.get(url, headers = headers)
              soup = BeautifulSoup(r.content, 'lxml')
              for star in soup.find_all("section", attrs={"class":"review__content"}):
                  rating = star.select_one(".star-rating.star-rating--medium >img")
                  date = star.select_one(".review-content-header__dates > script").text
                  date1=json.loads(date)
                  stars.append(rating['alt'])
                  dates.append(date1['publishedDate'])
              data = {"Rating": stars, "Dates": dates}
      
      print(data)
      

      输出

      {'Rating': ['5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '2 stars: Poor', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent'], 'Dates': ['2020-01-28T05:37:13Z', '2020-01-28T00:00:48Z', '2020-01-27T23:22:58Z', '2020-01-27T21:20:32Z', '2020-01-27T21:06:42Z', '2020-01-27T19:37:16Z', '2020-01-27T19:27:38Z', '2020-01-27T18:20:48Z', '2020-01-27T17:18:42Z', '2020-01-27T16:15:17Z', '2020-01-27T15:58:49Z', '2020-01-27T15:46:29Z', '2020-01-27T15:39:23Z', '2020-01-27T15:32:43Z', '2020-01-27T15:29:21Z', '2020-01-27T15:27:30Z', '2020-01-27T14:35:29Z', '2020-01-27T13:43:40Z', '2020-01-27T13:37:53Z', '2020-01-27T12:58:58Z']}
      

      【讨论】:

        【解决方案3】:

        将你的 for 循环更改为

        for star in soup.find_all("section", attrs={"class":"review__content"}):
            rating = star.select("div.star-rating > img") 
            date_tag = star.select("div.review-content-header__dates > script")    
            date = json.loads(date_tag[0].text)
            dt = datetime.strptime(date['publishedDate'], "%Y-%m-%dT%H:%M:%SZ")
            stars.append(rating[0]['alt'])
            dates.append(dt)
        

        【讨论】:

          猜你喜欢
          • 2021-04-11
          • 2016-08-01
          • 1970-01-01
          • 2020-10-05
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多