如何使用 rvest 将 Google 新闻结果抓取到 data.frame 中答案

【问题标题】：How to scrape Google News results into a data.frame with rvest如何使用 rvest 将 Google 新闻结果抓取到 data.frame 中
【发布时间】：2020-06-18 06:38:08
【问题描述】：

通过其他 SO 问题，我找到了如何获得头条新闻，但我不知道 Google 代码将链接存储在哪里。

我想要一个包含标题及其相应链接的 2 列 data.frame。

library(rvest)
library(tidyverse)


dat <- read_html("https://news.google.com/search?q=coronavirus&hl=en-US&gl=US&ceid=US%3Aen") %>%
  html_nodes('.DY5T1d') %>% #
  html_text()

dat

【问题讨论】：

Google 有点难抓取。 :) 所有链接都应保存在“href”中。如果你有一些困难，也许你应该使用 Rselenium。这样您就可以浏览网站了。
我在源代码中找到了描述参考，但仍然不知道链接存储在什么下
您是否尝试关注此stackoverflow.com/questions/35247033/… ？

标签： r rvest

【解决方案1】：

经过大量检查 Google 网络代码后，我找到了我想要的东西。我也看到了这些描述，所以我基本上重新构建了 Google 新闻 RSS 提要。

library(rvest)
library(tidyverse)


news <- function(term) {
  
  html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                      html_nodes('.VDXfz') %>% 
                      html_attr('href')) %>% 
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
  
  news_dat <- data.frame(
    Title = html_dat %>%
      html_nodes('.DY5T1d') %>% 
      html_text(),
    Link = dat$Link
  )
  
  return(news_dat)
}

news("coronavirus")

【讨论】：

抓取提示，您上面的代码在函数内部调用了两次read_html(url)。您应该使用page<-read_html(url) 阅读网页，然后使用此变量“page”来解析数据。这将提高脚本的性能并减少对服务器的页面访问次数。请在使用前阅读网站上的服务条款。仅供参考：通常抓取违反条款。
唯一的缺点是类可能随时更改并导致您的程序崩溃：/
Google 最近似乎删除了文章描述。我想某处有 sn-ps，但谁知道它们在哪里......