【问题标题】:How to reverse engineer POST request's body generation如何逆向工程 POST 请求的正文生成
【发布时间】:2020-10-13 16:34:11
【问题描述】:

我正在尝试从 Google Play 中抓取评论。页面滚动到末尾后,Google Play 会动态加载评论。我截获了浏览器为检索评论而发送的帖子请求,并注意到每个请求唯一改变的是请求的正文。我很难理解的是请求正文是如何生成的。

第一个请求的正文如下所示:

f.req: [[["UsvDTd","[null,null,[2,null,[40,null,\"CpUBCpIBKm0KOfc7ms0D_z7jKJielp7Fz8_Pz8_Pms3OzpuZyJvMnMXOxYmSxc3MyczPz8vIycjMysbHxszPysb__hAoITbZQaENmbWoMU2VCwWZPGwZOdccwQD8MmXEUABaCwlwT4zmNQBa2BADYMm1lu0EMiEKHwodYW5kcm9pZF9oZWxwZnVsbmVzc19xc2NvcmVfdjI\"],null,[]],[\"com.feelingtouch.zf3d\",7]]",null,"generic"]]]

这是第二个请求:

f.req: [[["UsvDTd","[null,null,[2,null,[40,null,\"CpUBCpIBKm0KOfc7msyg_28-Rpielp7Fz8_Pz8_Pm56eypyZzcycm8XOxYmSxc3MyczPz8vIycjMysbHxszPysb__hB4ITbZQaENmbWoMZI5V7V-7g3BObnBkABfM2XEUABaCwli2aizD1W9ExADYMm1lu0EMiEKHwodYW5kcm9pZF9oZWxwZnVsbmVzc19xc2NvcmVfdjI\"],null,[]],[\"com.feelingtouch.zf3d\",7]]",null,"generic"]]]

我能否以某种方式对请求的生成方式进行逆向工程?
我尝试使用 Selenium,但在向下滚动几十次之后,RAM 使用率上升,Selenium 变得无响应。

【问题讨论】:

  • 我的第一种方法是查看第二个请求中的任何数据部分是否可以在我的网络浏览器的网络工具的先前请求中的任何位置找到。否则,我认为某些浏览器允许您找到触发请求的 JavaScript 代码,您应该能够对其进行逆向工程。
  • 嗯,请求确实部分匹配,但这只是一堆随机字母,似乎没有任何意义。我不是很精通scrapy,最近开始学习scrapy。我将尝试找到一些有关在浏览器中跟踪和调试 JS 的信息。感谢您为我指明正确的方向@Gallaecio

标签: web-scraping scrapy http-post data-mining


【解决方案1】:

改变的是分页标记。但是,还有其他一些事情。

这是完整的编码请求正文,参数包含在 #{}(number_of_results、pagination_token 和 product_id)中。

f.req=%5B%5B%5B%22UsvDTd%22%2C%22%5Bnull%2Cnull%2C%5B2%2Cnull%2C%5B#{number_of_results}%2Cnull%2C#{pagination_token}%5D%2Cnull%2C%5B%5D%5D%2C%5B%5C%22#{product_id}%5C%22%2C7%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D

因此,每次滚动页面时,pagination_token 都会发生变化。他们使用它来检索下一页结果。

您不需要对令牌本身进行逆向工程。您可以在检查页面源时找到第一个,然后,每次下次您请求检索结果时,next_page_toke 将包含在其中。因此,您只需不断替换令牌,直到到达最后一页,然后检索所有评论。


或者,您可以使用 SerpApi 等第三方解决方案。我们为您处理代理、解决验证码并解析所有丰富的结构化数据。

用于检索 YouTube 评论的示例 python 代码(也可在其他库中获得):

from serpapi import GoogleSearch

params = {
  "api_key": "SECRET_API_KEY",
  "engine": "google_play_product",
  "store": "apps",
  "gl": "us",
  "product_id": "com.google.android.youtube",
  "all_reviews": "true"
}

search = GoogleSearch(params)
results = search.get_dict()

示例 JSON 输出:

  "reviews": [
    {
      "title": "Qwerty Jones",
      "avatar": "https://play-lh.googleusercontent.com/a/AATXAJwSQC_a0OIQqkAkzuw8nAxt4vrVBgvkmwoSiEZ3=mo",
      "rating": 3,
      "snippet": "Overall a great app. Lots of videos to see, look at shorts, learn hacks, etc. However, every time I want to go on the app, it says I need to update the game and that it's \"not the current version\". I've done it about 3 times now, and it's starting to get ridiculous. It could just be my device, but try to update me if you have any clue how to fix this. Thanks :)",
      "likes": 586,
      "date": "November 26, 2021"
    },
    {
      "title": "matthew baxter",
      "avatar": "https://play-lh.googleusercontent.com/a/AATXAJy9NbOSrGscHXhJu8wmwBvR4iD-BiApImKfD2RN=mo",
      "rating": 1,
      "snippet": "App is broken, every video shows no dislikes even after I hit the button. I've tested this with multiple videos and now my recommended is all messed up because of it. The ads are longer than the videos that I'm trying to watch and there is always a second ad after the first one. This app seriously sucks. I would not recommend this app to anyone.",
      "likes": 352,
      "date": "November 28, 2021"
    },
    {
      "title": "Operation Blackout",
      "avatar": "https://play-lh.googleusercontent.com/a-/AOh14GjMRxVZafTAmwYA5xtamcfQbp0-rUWFRx_JzQML",
      "rating": 2,
      "snippet": "YouTube used to be great, but now theyve made questionable and arguably stupid decisions that have effectively ruined the platform. For instance, you now have the grand chance of getting 30 seconds of unskipable ad time before the start of a video (or even in the middle of it)! This happens so frequently that its actually a feasible option to buy an ad blocker just for YouTube itself... In correlation with this, YouTube is so sensitive twords the public they decided to remove dislikes. Why????",
      "likes": 370,
      "date": "November 24, 2021"
    },
    ...
  ],
  "serpapi_pagination": {
    "next": "https://serpapi.com/search.json?all_reviews=true&engine=google_play_product&gl=us&hl=en&next_page_token=CpEBCo4BKmgKR_8AwEEujFG0VLQA___-9zuazVT_jmsbmJ6WnsXPz8_Pz8_PxsfJx5vJns3Gxc7FiZLFxsrLysnHx8rIx87Mx8nNzsnLyv_-ECghlTCOpBLShpdQAFoLCZiJujt_EovhEANgmOjCATIiCiAKHmFuZHJvaWRfaGVscGZ1bG5lc3NfcXNjb3JlX3YyYQ&product_id=com.google.android.youtube&store=apps",
    "next_page_token": "CpEBCo4BKmgKR_8AwEEujFG0VLQA___-9zuazVT_jmsbmJ6WnsXPz8_Pz8_PxsfJx5vJns3Gxc7FiZLFxsrLysnHx8rIx87Mx8nNzsnLyv_-ECghlTCOpBLShpdQAFoLCZiJujt_EovhEANgmOjCATIiCiAKHmFuZHJvaWRfaGVscGZ1bG5lc3NfcXNjb3JlX3YyYQ"
  }

查看documentation了解更多详情。

playground 上实时测试搜索。

免责声明:我在 SerpApi 工作。

【讨论】:

    猜你喜欢
    • 2019-02-02
    • 1970-01-01
    • 2018-05-16
    • 2013-10-11
    • 1970-01-01
    • 1970-01-01
    • 2018-09-07
    • 2011-07-27
    • 1970-01-01
    相关资源
    最近更新 更多