R - 抓取多个 URL 并单独保存答案

【问题标题】：R - Scrape a number of URLs and save individuallyR - 抓取多个 URL 并单独保存
【发布时间】：2020-05-18 17:16:24
【问题描述】：

免责声明：我不是专业的程序员，我对 R 的了解至少可以说是有限的。我也已经在 Stackoverflow 上搜索过解决方案（但无济于事）。

这是我的情况：我需要抓取一系列网页并保存数据（不太确定采用什么格式，但我会解决的）。幸运的是，我需要抓取的页面具有非常合乎逻辑的命名结构（它们使用日期）。

基本网址是：https://www.bbc.co.uk/schedules/p00fzl6p

我需要从 2018 年 8 月 1 日（其 URL 为 https://www.bbc.co.uk/schedules/p00fzl6p/2018/08/01）到昨天（其 URL 为 https://www.bbc.co.uk/schedules/p00fzl6p/2020/05/17）抓取所有内容。

到目前为止，我已经想办法创建一个日期列表，可以使用以下命令将其附加到基本 URL：

dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")

我可以使用以下内容将这些附加到基本 URL：

url <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/",dates)

但是，这几乎是我所获得的（不是很远，我知道！）我认为我需要使用 for 循环，但我自己的尝试证明是徒劳的。也许我没有以正确的方式解决这个问题？

如果不清楚，我要做的是访问每个 URL 并将 html 保存为单独的 html 文件（最好用相关日期标记）。事实上，我不需要所有的 html（只需要程序和时间的列表），但我可以在以后从相关文件中提取这些信息。

任何有关解决此问题的最佳方法的指导将不胜感激！如果您需要更多信息，请尽管询问。

【问题讨论】：

标签： html r for-loop web-scraping

【解决方案1】：

查看rvest 包和相关教程。例如。 https://www.datacamp.com/community/tutorials/r-web-scraping-rvest。混乱的部分是按照您想要的方式提取字段。

这是一种可能的解决方案：

library(rvest)
#> Loading required package: xml2
library(magrittr)
library(stringr)
library(data.table)
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d") 
urls <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/", dates)

get_data <- function(url){
    html <- tryCatch(read_html(url), error=function(e) NULL)
    if(is.null(html)) return(data.table(
        date=gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
        title=NA, description=NA)) else {
            time <- html %>%
                rvest::html_nodes('body') %>%
                xml2::xml_find_all("//div[contains(@class, 'broadcast__info grid 1/4 1/6@bpb2 1/6@bpw')]") %>%
                rvest::html_text() %>% gsub(".*([0-9]{2}.[0-9]{2}).*", "\\1", .)
            text <- html %>%
                rvest::html_nodes('body') %>% 
                xml2::xml_find_all("//div[contains(@class, 'programme__body')]") %>% 
                rvest::html_text() %>% 
                gsub("[ ]{2,}", " ", .) %>% gsub("[\n|\n ]{2,}", "\n", .) %>% 
                gsub("\n(R)\n", " (R)", ., fixed = TRUE) %>% 
                gsub("^\n|\n$", "", .) %>% 
                str_split_fixed(., "\n", 2) %>% 
                as.data.table() %>% setnames(.,  c("title", "description")) %>% 
                .[, `:=`(date = gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
                         time = time,
                         description = gsub("\n", " ", description))] %>% 
                setcolorder(., c("date", "time", "title", "description"))
            text
        }
}
res <- rbindlist(parallel::mclapply(urls, get_data, mc.cores = 6L))
res
#>              date  time
#>     1: 2018/08/01 06:00
#>     2: 2018/08/01 09:15
#>     3: 2018/08/01 10:00
#>     4: 2018/08/01 11:00
#>     5: 2018/08/01 11:45
#>    ---                 
#> 16760: 2020/05/17 22:20
#> 16761: 2020/05/17 22:30
#> 16762: 2020/05/17 00:20
#> 16763: 2020/05/17 01:20
#> 16764: 2020/05/17 01:25
#>                                                                       title
#>     1:                                                 Breakfast—01/08/2018
#>     2:                           Wanted Down Under—Series 11, Hanson Family
#>     3:                          Homes Under the Hammer—Series 21, Episode 6
#>     4:                                     Fake Britain—Series 7, Episode 7
#>     5: The Farmers' Country Showdown—Series 2 30-Minute Versions, Ploughing
#>    ---                                                                     
#> 16760:                                     BBC London—Late News, 17/05/2020
#> 16761:                                                       Educating Rita
#> 16762:                          The Real Marigold Hotel—Series 4, Episode 2
#> 16763:                                Weather for the Week Ahead—18/05/2020
#> 16764:                                            Joins BBC News—18/05/2020
#>                                                                                       description
#>     1:                The latest news, sport, business and weather from the BBC's Breakfast team.
#>     2: 22/24 Will a week in Melbourne help Keith persuade his wife Mary to move to Australia? (R)
#>     3:               Properties in Hertfordshire, Croydon and Derbyshire are sold at auction. (R)
#>     4:                       7/10 The fake sports memorabilia that cost collectors thousands. (R)
#>     5: 13/20 Farmers show the skill and passion needed to do well in a top ploughing competition.
#>    ---                                                                                           
#> 16760:                                            The latest news, sport and weather from London.
#> 16761:  Comedy drama about a hairdresser who dreams of rising above her drab urban existence. (R)
#> 16762:   2/4 The group take a night train to Madurai to attend the famous Chithirai festival. (R)
#> 16763:                                                                 Detailed weather forecast.
#> 16764:                          BBC One joins the BBC's rolling news channel for a night of news.

^{由reprex package (v0.3.0) 于 2020-05-18 创建}

【讨论】：

这非常有效 - 非常感谢。经过更多挖掘后，我意识到rvest 可能是解决此问题的方法。但是，我认为这仍然超出了我的技能范围。所以感谢你超越。我真的很感激！
好的，我已经阅读了更多内容，开始了解rvest。但是，我想知道是否有办法同时抓取每个标题的底层 URL？我知道 URL 类以及它应该如何构造，但我不确定如何将它集成到上面的代码中。这就是我想出的：html_nodes(".br-blocklink__link") %>% html_attr("href")。另外，我想知道是否有办法将节目标题和副标题作为不同的字段？我再次找到了每个 ("programme__title") 和 ("programme__subtitle") 的相关类。