从 Reddit 帖子中删除评论？答案

【问题标题】：Scraping Comments from a Reddit Post?从 Reddit 帖子中删除评论？
【发布时间】：2022-11-10 02:10:38
【问题描述】：

我在这里找到了这个 reddit 帖子 - https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/。

我想以这样的方式使用 API，这样我就可以从这篇文章中获取所有的 cmets。

我尝试查看此 API 的文档（例如 https://github.com/pushshift/api），这似乎不可能？如果我不知何故得到了与这个 reddit 帖子有关的 LINK_ID，我想我可以做到。

这可能吗？

谢谢！

【问题讨论】：

标签： json api web-scraping reddit

【解决方案1】：

我建议您使用WebScrapingAPI 的extract_rules 功能，它返回一个可以使用CSS 选择器提取的元素数组。例如，我在以下 GET 请求中使用了 [data-testid='comment'] 作为 CSS 选择器：

https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/&render_js=1&extract_rules={"comments":{"selector":"[data-testid='comment']", "output":"text"}}

我得到了：

{
   "comments":[
      "I wonder what's the most number of living ex-presidents there have been at one time?",
      "The highest number is six—occurring in four different periods in history. The most recent period was 2017-2018 before GHW Bush died.",
      "I don't understand what the first half of your title is doing there, other than to confuse and cause a person to have to read the whole title a couple of times to work out that all the living ex-presidents are invited to QEII's DC memorial service.",
      "Agreed, OP is pretty awful at writing headlines.",
      "Former disgraced president trump",
      "No, he's still disgraced.",
      "If the link is behind a paywall, or for an ad-free version:outline.comOr if you want to see the full original page:archive.org or archive.fo or  12ft.ioOr Google cache:https://www.google.com/search?q=site:https://www.townandcountrymag.com/society/politics/a41245384/donald-trump-barack-obama-george-bush-queen-elizabeth-memorial/I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns."
   ]
}

【讨论】：

【解决方案2】：

帖子的链接 ID 在 URL 中 https://www.reddit.com/r/obama/comments/xgsxy7 <-- id

你甚至可以搜索https://www.reddit.com/xgsxy7 来获取信息。

如果您在端点https://www.reddit.com/xgsxy7.json 获取，您将获得 JSON 信息，然后您应该访问该对象以找到它们。

JS 示例：

const data = fetchedJSONObject;

const replies = data[1].data.children.map(reply => reply.data.body); // to get the text body

您可以只分析 JSON 对象并从中获取所需的所有数据：如果回复有一些嵌套回复、创建时间等。

【讨论】：