【问题标题】:Scraping Comments from a Reddit Post?从 Reddit 帖子中删除评论?
【发布时间】:2022-11-10 02:10:38
【问题描述】:

我在这里找到了这个 reddit 帖子 - https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/

我想以这样的方式使用 API,这样我就可以从这篇文章中获取所有的 cmets。

我尝试查看此 API 的文档(例如 https://github.com/pushshift/api),这似乎不可能?如果我不知何故得到了与这个 reddit 帖子有关的 LINK_ID,我想我可以做到。

这可能吗?

谢谢!

【问题讨论】:

    标签: json api web-scraping reddit


    【解决方案1】:

    我建议您使用WebScrapingAPIextract_rules 功能,它返回一个可以使用CSS 选择器提取的元素数组。例如,我在以下 GET 请求中使用了 [data-testid='comment'] 作为 CSS 选择器:

    https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/&render_js=1&extract_rules={"comments":{"selector":"[data-testid='comment']", "output":"text"}}
    

    我得到了:

    {
       "comments":[
          "I wonder what's the most number of living ex-presidents there have been at one time?",
          "The highest number is six—occurring in four different periods in history. The most recent period was 2017-2018 before GHW Bush died.",
          "I don't understand what the first half of your title is doing there, other than to confuse and cause a person to have to read the whole title a couple of times to work out that all the living ex-presidents are invited to QEII's DC memorial service.",
          "Agreed, OP is pretty awful at writing headlines.",
          "Former disgraced president trump",
          "No, he's still disgraced.",
          "If the link is behind a paywall, or for an ad-free version:outline.comOr if you want to see the full original page:archive.org or archive.fo or  12ft.ioOr Google cache:https://www.google.com/search?q=site:https://www.townandcountrymag.com/society/politics/a41245384/donald-trump-barack-obama-george-bush-queen-elizabeth-memorial/I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns."
       ]
    }
    

    【讨论】:

      【解决方案2】:

      帖子的链接 ID 在 URL 中 https://www.reddit.com/r/obama/comments/xgsxy7 <-- id

      你甚至可以搜索https://www.reddit.com/xgsxy7 来获取信息。

      如果您在端点https://www.reddit.com/xgsxy7.json 获取,您将获得 JSON 信息,然后您应该访问该对象以找到它们。

      JS 示例:

      const data = fetchedJSONObject;
      
      const replies = data[1].data.children.map(reply => reply.data.body); // to get the text body
      

      您可以只分析 JSON 对象并从中获取所需的所有数据:如果回复有一些嵌套回复、创建时间等。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-09-01
        相关资源
        最近更新 更多