跟踪 dplyr 链中哪个组失败答案

【问题标题】：Tracking which group fails in a dplyr chain跟踪 dplyr 链中哪个组失败
【发布时间】：2017-02-16 18:02:54
【问题描述】：

当在dplyr 类型链中使用group_by 时，如何找出哪个组失败了。举个例子：

library(dplyr)

data(iris)

iris %>%
  group_by(Species) %>%
  do(mod=lm(Petal.Length ~ Petal.Width, data = .)) %>%
  mutate(Slope = summary(mod)$coeff[2])

工作正常。现在如果我添加一些问题数据到iris:

iris$Petal.Width[iris$Species=="versicolor"]= NA

这样在尝试运行线性模型时会失败：

iris_sub <- iris[iris$Species=="versicolor",]
lm(Petal.Length ~ Petal.Width, data = iris_sub)

但是，如果我在运行这个时使用大量数据集接近这个盲人：

iris %>%
  group_by(Species) %>%
  do(mod=lm(Petal.Length ~ Petal.Width, data = .)) %>%
  mutate(Slope = summary(mod)$coeff[2])

此错误消息无法帮助我找到模型在哪个级别失败：

lm.fit 中的错误（x，y，偏移量 = 偏移量，奇异值.ok = 奇异值.ok，...） : 0 (non-NA) 案例

我可以使用如下所示的循环。这至少让我知道函数在哪个级别的Species 上失败。但是，我更喜欢使用 dplyr 设置：

lmdf <- c()
for (i in unique(iris$Species)) {
  cat(i, "\n")
  u <- iris %>%
    filter(Species==i) %>%
    do(mod=lm(Petal.Length ~ Petal.Width, data = .))
  lmdf = rbind(lmdf, u)
}

关于实现这一目标的更好方法的任何建议？总而言之，我正在尝试使用dplyr 类型框架来确定在哪个组级别，函数失败。

引用here 的tryCatch 解决方案似乎不再有效。我收到此错误：

tryCatch 中的错误(lm(v3 ~ v4, df), error = if (e$message == all_na_msg) default else stop(e)) : object 'e' not found

【问题讨论】：

非常接近。 catch NAs using linear model with dplyr 的可能重复项
我觉得我可能有点歪曲了这个问题。 lm() 可以是任何失败的函数。我要做的是确定 do() 调用的函数在哪个组（或组组合）失败。
@Axeman - 我对 purrr 的例子很感兴趣。但是我不确定如何处理由safe_lm. For example:: safe_lm safe_lm(Petal.Length ~ Petal.Width, data = iris_sub) 输出一个列表。如何将其合并到我的链中 iris %> group_by(Species) %>% do(safe_lm(.))？

标签： r dplyr

【解决方案1】：

使用purrr::safely的完整示例：

准备

library(tidyverse)
data(iris)
iris$Petal.Width[iris$Species == "versicolor"] <-  NA

安全运行模型

如果您对实际错误不感兴趣（即原因是 0 (non-NA) cases），您可以这样做：

iris %>%
  group_by(Species) %>%
  do(mod = safely(lm)(Petal.Length ~ Petal.Width, data = .)$result) %>% 
  mutate(Slope = ifelse(!is.null(mod), summary(mod)$coeff[2], NA))

我们完成了！

Source: local data frame [3 x 3]
Groups: <by row>

# A tibble: 3 × 3
     Species      mod     Slope
      <fctr>   <list>     <dbl>
1     setosa <S3: lm> 0.5464903
2 versicolor   <NULL>        NA
3  virginica <S3: lm> 0.6472593

我们可以清楚地看到哪个组失败了（因为它有NULL 而不是模型，而且它的Slope 是未知的）。此外，我们仍然为其他组获得了正确的模型和斜率，因此我们没有浪费计算时间（这在大型数据集上运行复杂模型时会非常好）。

跟踪模型和错误

step1 <- iris %>%
  group_by(Species) %>%
  do(res = safely(lm)(Petal.Length ~ Petal.Width, data = .)) %>%
  mutate(err = map(list(res), 'error'),
         mod = map(list(res), 'result'))

不幸的是，我们不得不在那里使用额外的list 调用，不完全确定原因。或者，您可以先ungroup。

要查看哪些（如果有）组有错误，我们可以使用：

filter(step1, !is.null(err))

为了挽救未出错组的结果，只需先filter：

step1 %>% 
  filter(is.null(err)) %>% 
  mutate(Slope = summary(mod)$coeff[2])

如果想获得整齐链中模型的系数，也可以查看broom包。

【讨论】：

就是这样！快速跟进：如果我有这样的小标题：6 <list [2]> <NULL> <data.frame [1 × 4]> - 我将如何提取数据框？我看到你是如何处理模型结果的。但是如何提取数据框？
使用unnest，我猜，但您可能必须先将filter 排除在NULL 行之外。

【解决方案2】：

如果您不喜欢 dplyr，您可以使用拆分应用方法和基础 R 中的 try。这是一种方法：

# use split to make a list of data sets by group (here, species)
iris.split <- split(iris, iris$Species)

# iterate your modeling function over that list, using 'try' to let the
# process keep running when an error is thrown and logging an object of
#class "try-error" in that slot on the resulting list
iris.mods <- lapply(iris.split, function(i) try(lm(Petal.Length ~ Petal.Width, data = i)))

# get a vector of slopes from those models with NA where any errors killed
# the modeling process
slopes <- sapply(iris.mods, function(x) ifelse(is(x, "try-error"), NA, x$coefficients[2]))

结果：

> slopes
    setosa versicolor  virginica 
 0.5464903         NA  0.6472593

【讨论】：

这真是一个很棒的答案。不幸的是，在这种情况下，我嫁给了 dplyr。