【问题标题】:Scraping hidden data [ window.__WEB_CONTEXT__= ] ... preferably with Scrapy抓取隐藏数据 [ window.__WEB_CONTEXT__= ] ...最好使用 Scrapy
【发布时间】:2020-11-26 21:28:34
【问题描述】:

我正在抓取tripadvisor。我现在的问题是抓取给定酒店的Hotelstars(不是平均用户评分[气泡],而是酒店等级评分),我稍后会遇到评论被隐藏在“阅读更多”后面的问题。 https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html 幸好我知道数据在哪里可以找到两者。它在这个标签内的页面中:

<script window.__WEB_CONTEXT={pageManifest:{"assets":[.... 
....
</script>

在此处搜索 https://pastebin.com/Ww3ugxFR 以获取“景色太棒了!!” (隐藏文本示例)或 Hotelstars 的 '"star":'。

我想学习如何访问这个标签。

这是我如何不起作用的示例。我需要学习如何告诉 CSS 选择器(或其他工具)如何解决这个特定问题以及如何从中提取数据。在此示例中,我将仅加载响应并进行简单的模式搜索。我想也可以用 Json 加载它并从那里提取,但我还没有确定 Json。:

hotel_CONTEXT = response.css("script text=window.__WEB_CONTEXT ::attr(pageManifest)).extract()

pattern_hotelstar = re.compile(r'star":\["\d')
matches_hotelstar = pattern_hotelstar.findall(hotel_CONTEXT)
Hotel_stars = str(matches_hotelstar).split('"')[2].split("'")[0]

显然,我想通过 BeautifulSoup 实现我想要实现的目标(Scraping a website with data hidden under "read more" ...但是我在尝试复制时遇到了 json 错误),但通常我更喜欢使用 Scrapy 的解决方案。


Andrej Kesely 为我的问题提供了出色的解决方案!他的代码运行良好,我想完全理解它!以下是我认为从代码中理解的内容以及我不理解他的巫术的地方;):

data = re.search(r'window\.__WEB_CONTEXT__=(.*?});', html_text).group(1)

Andrej 在整个 html_text 中搜索以“window.__WEB...”开头的模式,将模式扩展到所有字符 (.),以非贪婪方式 (?) 任意次数 (*)并以“;”结尾。我不明白为什么会有一个带有 } init 的捕获组,以及为什么 } 不只是放在最后,因为脚本以 } 结尾; (Andrej 是如何发现这一点的?这是这些的一般模式还是他打印了整个页面并进行了查找?)。我也不明白为什么它必须是非贪婪的。 Group(1) 选择了离开窗口的第一个括号内的所有内容。WEB_CONTEXT= out。我想这与使用 json 加载结果有关。也一样

data = data.replace('pageManifest', '"pageManifest"')   

然后,Andrej 创建了一个名为 traverse 的函数,该函数稍后将填充数据的输出。在 if 语句中,Andrej 检查输入是否是字典。在下一步中,Andrej 循环遍历字典的 key(k) 和 value(v)。如果 k=="reviews" 他产生价值。如果不是“从功能中获得收益”?我也迷失了 elif 和检查 val 是否是一个列表......一般来说,函数的输出 v 是什么?我将如何更改函数以包含更多要滚动的字典,因为 else 已被此 yield from 占用。

def traverse(val):
if isinstance(val, dict):
    for k, v in val.items():
        if k == 'reviews':
            yield v
        else:
            yield from traverse(v)
elif isinstance(val, list):
    for v in val:
        yield from traverse(v)
 

这里 Andrej 循环遍历 traverse(data)(字典,对吗?)。由于我们在此页面上收到了多条评论。 在嵌套循环中,Andrej 为单个评论中的每个字典命名为 r,并通过 dictonary_name["key"] 检索存储的值。我说的对吗?

for reviews in traverse(data):
  for r in reviews:
    print(r['userProfile']['displayName'])
    print(r['title'])
    print(r['text'])
    print('Rating:', r['rating'])
    print('-' * 80)

抱歉所有这些新手问题。

【问题讨论】:

  • 您能否提供一些代码作为上下文并解释您的数据需求是什么。看起来您想要此网站页面上的评论和酒店星级?
  • 嗨@AaronS,我希望我的问题现在更清楚了。我想我的问题基本上可以归结为:如何在没有任何类的情况下访问脚本标签。其他人显然可以用 BeautifulSoup(见链接)做到这一点,但我在他们的代码上遇到了问题,通常我会更乐意使用 Scrapy 完成所有事情。

标签: python web-scraping beautifulsoup scrapy tripadvisor


【解决方案1】:

此脚本将打印页面上找到的所有评论和评论评级:

import re
import json
import requests


url = 'https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html'
html_text = requests.get(url).text

data = re.search(r'window\.__WEB_CONTEXT__=(.*?});', html_text).group(1)
data = data.replace('pageManifest', '"pageManifest"')
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

def traverse(val):
    if isinstance(val, dict):
        for k, v in val.items():
            if k == 'reviews':
                yield v
            else:
                yield from traverse(v)
    elif isinstance(val, list):
        for v in val:
            yield from traverse(v)

for reviews in traverse(data):
    for r in reviews:
        print(r['userProfile']['displayName'])
        print(r['title'])
        print(r['text'])
        print('Rating:', r['rating'])
        print('-' * 80)

打印:

BBDoll619
Just WOW!!
Okay, I didn't know this resort would be mainly couples and honeymooners as I went with 2 friends. We weren't uncomfortable though and met lots of nice people from across the globe and 1 couple from the US. This resort can only be reached by boat, so it is very secluded. We stayed in bungalow #2. It was rustic, but beautiful and right on the beach. Everyone who worked in the resort was friendly and very accommodating. We ate most meals at the resort which was pretty good. We had happy hour at the pier bar every day which was from 4-7pm. They had half off certain drinks and food specials. It was very nice relaxing, enjoying a great drink and watching the sunset. You can snorkel right in front of the resort which was so cool! We snorkeled for 2 hours!! The best is right by the floating bungalows where they did massages. Speaking of massages....OMG! It was heaven!! Very affordable and different. When you lie face down, you look into a cut out in the floor, so you can view the water and fish swimming by. I loved it!! We did an island hopping tour and it was not an issue coming from this resort. When we got into Coron town and passed by all the hotels in that area, we were so glad and thankful we chose El Rio Y Mar. Coron Town is very dirty, dusty, full of young backpackers and the hotels look subpar. It's fine if you're on a budget. I get it, but us girls/mom/friends wanted to treat ourselves. That we did! One day we went on a guided hike to the top of a closeby mountain. The view was fantastic!! I highly recommend this resort and would definitely return.
Rating: 5
--------------------------------------------------------------------------------
MaricrisAndPiotr
Amazing staff
The best customer experience we ever had! the school of fishes within the resort are amazing, very quite, very clean and well maintained rooms and outdoor surroundings. Our island trip organized by them is one of the best experience we had in our Coron trip. 
Kudos to El Rio highly recommended
Rating: 5
--------------------------------------------------------------------------------

...and so on.

【讨论】:

  • 嗨 Andrej,很高兴您的回答。非常感谢!它对我来说非常好......非常好,我想以这种方式从
【解决方案2】:

您在问题中提到了scrapy,因此使用它和chompjs 添加解决方案。

scrapy shell https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html

>>>import chompjs
>>>resp=response.xpath("//script[contains(.,'requests')]/text()").extract_first() 
>>>data=chompjs.parse_js_object(resp)

对于解析数据,Andrej 的解决方案非常有效。您可以在常规的 Spider 类中添加 def,只需稍作调整

import chompjs
import scrapy
Class SomeSpider (scrapy.Spider)
--------Code here------------
--------Code here------------
 def traverse(self,val):
   if isinstance(val, dict):
     for k, v in val.items():
       if k == 'reviews':
         yield v
       else:
         yield from self.traverse(v)
   elif isinstance(val, list):
     for v in val:
        yield from self.traverse(v)

   

 def parse(self, response)
    
  resp=response.xpath("//script[contains(.,'requests')]/text()").extract_first()
  data=chompjs.parse_js_object(resp)
  for reviews in self.traverse(data):
   for r in reviews:
    yield {---code here---
           ---code here---
    }

【讨论】:

  • 嗨,Sujith,非常感谢您的意见!我可以轻松地将 Andrej 的函数放入scrapy 的一般解析函数中吗?那么 def 在 def 内是可能的还是需要在外面?那么如何在scrapy结构中实现它呢?非常感谢!来自哥本哈根的问候!
  • 根据基本蜘蛛模板编辑了我的原始帖子。有帮助吗?
猜你喜欢
  • 1970-01-01
  • 2019-06-18
  • 1970-01-01
  • 2013-05-23
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多