【发布时间】:2021-03-29 01:11:46
【问题描述】:
我正在从网站上抓取内容。为此,我遍历链接。如果发生错误,purrr 的 possibly 副词应继续进行,并在结果中放置一个“缺失”(或“NA_character”)。
当链接到的站点不存在时,以下代码将按预期工作,即输出“缺失”;
但是,如果链接到的站点存在,但我尝试从站点中提取的元素不存在,则尽管为“otherwise”定义了值,该函数仍会引发错误。
对我来说这很令人惊讶,因为文档指出
'
possibly: 包装函数在发生错误时使用默认值 (otherwise)。'
知道为什么会这样吗?我知道我可以相应地修改函数(例如检查返回对象的长度)。但我不明白为什么没有使用 'otherwise' 值。
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.4
#> Warning: package 'tidyr' was built under R version 4.0.4
#> Warning: package 'dplyr' was built under R version 4.0.4
library(rvest)
#> Warning: package 'rvest' was built under R version 4.0.4
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
# possibly with wrong links when scraping site ----------------------------
#see https://github.com/tidyverse/purrr/issues/409
sample_data <- tibble::tibble(
link = c(
#link ok, selected item exists
"https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll",
#link not ok
"https://www.wrong-url.foobar",
#link ok, selected item does not exist on site
"https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
)
)
fn_get_link_to_records <- function(link_to_overview_sessions) {
print(link_to_overview_sessions)
link_to_overview_sessions %>%
rvest::read_html() %>%
rvest::html_elements("a") %>%
rvest::html_attr("href") %>%
enframe(name = NULL,
value = "link_to_text") %>%
filter(str_detect(link_to_text, regex("\\/NRSITZ_\\d+\\/fnameorig_\\d+\\.html$"))) %>%
mutate(link_to_text=glue::glue("https://www.parlament.gv.at/{link_to_text}")) %>%
pull()
}
sample_data %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise=NA_character_)))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 3 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = NA_character_))`.
sample_data %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 3 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = "missing"))`.
由reprex package (v1.0.0) 于 2021-03-28 创建
更新:我添加了下面的输出以使意外结果(最后一个块)更清晰。
sample_data[1:2,] %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> # A tibble: 2 x 2
#> link link_to_text
#> <chr> <chr>
#> 1 https://www.parlament.gv.at/PAKT/VHG~ https://www.parlament.gv.at//PAKT/VHG/X~
#> 2 https://www.wrong-url.foobar missing
sample_data[3, ] %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 1 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = "missing"))`.
由reprex package (v1.0.0) 于 2021 年 3 月 29 日创建
【问题讨论】:
-
对我来说,它按预期工作,3 个值为“缺失”,没有任何错误。你的
packageVersion('purrr')是什么?我在[1] ‘0.3.4’。 -
很容易测试,运行这个:
possibly( function(){stop("FOO!")}, otherwise=NA, quiet=TRUE )()。 -
@RonakShah 我在 0.3.4。也是。我更新了上面的问题,以表明前 2 个链接会产生预期的输出;然而,最后一个让我感到惊讶。
-
我刚刚在
fn_get_link_to_records函数中发现rvest中没有html_elements函数,而且read_html不是来自rvest而是xml2。我不确定它是如何为您工作的,但即使对于前 2 个值,由于这些错误,我也将它们都设为missing。 -
html_elements 是在 rvest 1.0 中引入的