Rvest 和 Google 新闻网页抓取：不起作用

【问题标题】：Rvest and Google News Web Scraping: Not WorkingRvest 和 Google 新闻网页抓取：不起作用
【发布时间】：2022-01-19 18:19:01
【问题描述】：

我是网络抓取的新手，下面的代码生成一个空字符向量，不知道如何解决：

google_url <- "https://news.google.com/topstories?hl=en-GB&gl=GB&ceid=GB:en"
google <- read_html(google_url)
articles <- google %>% html_nodes('.VDXfz') %>% html_text()
articles

【问题讨论】：

您可以通过html_nodes('.VDXfz') %>% html_attr('href')获取网页链接，但无法获得头条。
请务必让我们知道您尝试使用的软件包。图书馆（rvest）
请提供足够的代码，以便其他人更好地理解或重现问题。

标签： r web-scraping tidyverse rvest xml2

【解决方案1】：

以下内容将从当前加载的页面中获取所有标题。如果您需要滚动并进一步提取数据，则需要RSelenium。

library(rvest)
url = 'https://news.google.com/topstories?hl=en-GB&gl=GB&ceid=GB:en'

url %>% read_html() %>% html_nodes('.lBwEZb') %>% 
  html_nodes('.DY5T1d') %>% 
  html_text()

[1] "Liz Truss to hold Brexit talks with EU over NI protocol"                                                                             
 [2] "Lord Frost: I didn't support PM's coercive Covid plan"                                                                               
 [3] "David Frost: I never disagreed with Boris Johnson over Brexit policy – only coercive Covid rules"                                    
 [4] "Look at the lauding of David Frost and see a government deranged by the poison of Brexit"                                            
 [5] "What happened to the amiable, hard-working David Frost I once knew?"                                                                 
 [6] "COVID-19: Omicron now dominant variant in US after making up 73% of new cases, says CDC"

【讨论】：