【问题标题】:Error despite purrr's 'otherwise' - Why is purrr/possibly's 'otherwise' not triggered?尽管发出 purrr 的“否则”错误 - 为什么没有触发 purrr/可能的“否则”?
【发布时间】:2021-03-29 01:11:46
【问题描述】:

我正在从网站上抓取内容。为此,我遍历链接。如果发生错误,purrrpossibly 副词应继续进行,并在结果中放置一个“缺失”(或“NA_character”)。

当链接到的站点不存在时,以下代码将按预期工作,即输出“缺失”; 但是,如果链接到的站点存在,但我尝试从站点中提取的元素不存在,则尽管为“otherwise”定义了值,该函数仍会引发错误。

对我来说这很令人惊讶,因为文档指出

' possibly : 包装函数在发生错误时使用默认值 (otherwise)。'

知道为什么会这样吗?我知道我可以相应地修改函数(例如检查返回对象的长度)。但我不明白为什么没有使用 'otherwise' 值。

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.4
#> Warning: package 'tidyr' was built under R version 4.0.4
#> Warning: package 'dplyr' was built under R version 4.0.4
library(rvest)
#> Warning: package 'rvest' was built under R version 4.0.4
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

# possibly with wrong links when scraping site ----------------------------
#see https://github.com/tidyverse/purrr/issues/409

sample_data <- tibble::tibble(
  link = c(
    #link ok, selected item exists
    "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll",
    #link not ok
    "https://www.wrong-url.foobar",
    #link ok, selected item does not exist on site
    "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
    
           )
)


fn_get_link_to_records <- function(link_to_overview_sessions) {
  
print(link_to_overview_sessions)
    
link_to_overview_sessions %>% 
    rvest::read_html() %>% 
    rvest::html_elements("a") %>% 
    rvest::html_attr("href") %>% 
    enframe(name = NULL,
            value = "link_to_text") %>% 
    filter(str_detect(link_to_text, regex("\\/NRSITZ_\\d+\\/fnameorig_\\d+\\.html$"))) %>% 
    mutate(link_to_text=glue::glue("https://www.parlament.gv.at/{link_to_text}")) %>% 
    pull()
}


sample_data %>% 
  mutate(link_to_text=map_chr(link, 
                              possibly(fn_get_link_to_records,
                                       otherwise=NA_character_)))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 3 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = NA_character_))`.

sample_data %>% 
  mutate(link_to_text=map_chr(link, 
                              possibly(fn_get_link_to_records,
                                       otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 3 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = "missing"))`.

reprex package (v1.0.0) 于 2021-03-28 创建

更新:我添加了下面的输出以使意外结果(最后一个块)更清晰。

sample_data[1:2,] %>% 
  mutate(link_to_text=map_chr(link, 
                              possibly(fn_get_link_to_records,
                                       otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> # A tibble: 2 x 2
#>   link                                  link_to_text                            
#>   <chr>                                 <chr>                                   
#> 1 https://www.parlament.gv.at/PAKT/VHG~ https://www.parlament.gv.at//PAKT/VHG/X~
#> 2 https://www.wrong-url.foobar          missing
sample_data[3, ] %>% 
  mutate(link_to_text=map_chr(link, 
                              possibly(fn_get_link_to_records,
                                       otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 1 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = "missing"))`.

reprex package (v1.0.0) 于 2021 年 3 月 29 日创建

【问题讨论】:

  • 对我来说,它按预期工作,3 个值为“缺失”,没有任何错误。你的packageVersion('purrr') 是什么?我在[1] ‘0.3.4’
  • 很容易测试,运行这个:possibly( function(){stop("FOO!")}, otherwise=NA, quiet=TRUE )()
  • @RonakShah 我在 0.3.4。也是。我更新了上面的问题,以表明前 2 个链接会产生预期的输出;然而,最后一个让我感到惊讶。
  • 我刚刚在fn_get_link_to_records函数中发现rvest中没有html_elements函数,而且read_html不是来自rvest而是xml2。我不确定它是如何为您工作的,但即使对于前 2 个值,由于这些错误,我也将它们都设为 missing
  • html_elements 是在 rvest 1.0 中引入的

标签: r purrr


【解决方案1】:

错误来自map_chr,但您将possibly 包裹在fn_get_link_to_records 函数周围。如果您运行fn_get_link_to_records(sample_data$link[3]),您将看到 URL get 被打印出来,没有返回任何内容,也没有生成错误。但是,map_chr 无法将此空输出更改为字符值,因此您会收到错误消息。如果您使用 map 而不是 map_chr,您会看到它有效。

sample_data[3,] %>% 
  mutate(link_to_text= map(link, fn_get_link_to_records))

#[1] #"https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
# A tibble: 1 x 2
#  link                                                                                     link_to_text
#  <chr>                                                                                    <list>      
#1 https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Pro… <glue [0]> 

但是link_to_text 是空的。您已经知道的解决方案是检查输出值的长度并返回NA 或在fn_get_link_to_records 函数中生成错误,这些情况将使用possibly 处理。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-04-11
    • 2012-10-23
    • 1970-01-01
    • 1970-01-01
    • 2011-08-27
    • 2023-02-08
    • 1970-01-01
    相关资源
    最近更新 更多