识别 R 数据框中最后一次出现的值的列名答案

【问题标题】：Identify the column name of the last occurrence of a value in R data frame识别 R 数据框中最后一次出现的值的列名
【发布时间】：2021-06-11 03:19:40
【问题描述】：

我有一个如下所示的数据集，其中包含 1 和 0 列。我想添加最后一列，标识每行最后出现 0 的列名。

have = data.frame(a = c(1,0,1,1,0,0,1,1,1,0),
                  b = c(1,0,1,1,1,0,1,1,0,0),
                  c = c(0,0,0,1,0,1,1,1,1,0),
                  d = c(1,0,1,1,0,0,0,1,0,1),
                  e = c(1,1,1,1,1,1,1,1,1,1))
> have
   a b c d e
1  1 1 0 1 1
2  0 0 0 0 1
3  1 1 0 1 1
4  1 1 1 1 1
5  0 1 0 0 1
6  0 0 1 0 1
7  1 1 1 0 1
8  1 1 1 1 1
9  1 0 1 0 1
10 0 0 0 1 1

我希望输出看起来像这样，其中最后一列指定最后出现的 0 的列名，如果不存在则返回 NA。

> want
   a b c d e last_0
1  1 1 0 1 1      c
2  0 0 0 0 1      d
3  1 1 0 1 1      c
4  1 1 1 1 1   <NA>
5  0 1 0 0 1      d
6  0 0 1 0 1      d
7  1 1 1 0 1      d
8  1 1 1 1 1   <NA>
9  1 0 1 0 1      d
10 0 0 0 1 1      c

我尝试过使用 max.col，但如果不存在零，它会返回最后一个列名。还有其他解决方案吗？首选 dplyr 解决方案。

> have$last_0 = names(have)[max.col(have == 0, ties.method = "last")]
> have
   a b c d e last_0
1  1 1 0 1 1      c
2  0 0 0 0 1      d
3  1 1 0 1 1      c
4  1 1 1 1 1      e
5  0 1 0 0 1      d
6  0 0 1 0 1      d
7  1 1 1 0 1      d
8  1 1 1 1 1      e
9  1 0 1 0 1      d
10 0 0 0 1 1      c

【问题讨论】：

标签： r dplyr

【解决方案1】：

这是purrr::pmap的方法：

library(dplyr);library(purrr)
have %>% 
   mutate(want = pmap_chr(cur_data(), 
                          ~ tail(c(NA,names(which(c(...)==0))),1)))
   a b c d e want
1  1 1 0 1 1    c
2  0 0 0 0 1    d
3  1 1 0 1 1    c
4  1 1 1 1 1 <NA>
5  0 1 0 0 1    d
6  0 0 1 0 1    d
7  1 1 1 0 1    d
8  1 1 1 1 1 <NA>
9  1 0 1 0 1    d
10 0 0 0 1 1    c

purrr:pmap 是一个非常有用的函数，因为它可以逐行处理数据，并且它有多种形式，因此您可以控制返回的内容。可以用c(...)引用整行数据。

如果您只想将该过程应用于列的子集，您可以使用dplyr::select：

have %>% 
    mutate(want = pmap_chr(cur_data() %>% select(a,b,c), 
                           ~ tail(c(NA,names(which(c(...)==0))),1)))
   a b c d e want
1  1 1 0 1 1    c
2  0 0 0 0 1    c
3  1 1 0 1 1    c
4  1 1 1 1 1 <NA>
5  0 1 0 0 1    c
6  0 0 1 0 1    b
7  1 1 1 0 1 <NA>
8  1 1 1 1 1 <NA>
9  1 0 1 0 1    b
10 0 0 0 1 1    c

【讨论】：

谢谢！这很有帮助。有没有办法选择特定的列来应用它？

【解决方案2】：

我们可以使用max.col，然后将那些没有任何0的元素替换为NA

have$last_0 <- names(have)[(NA^!rowSums(have == 0)) * max.col(have == 0, 'last')]

-输出

have
   a b c d e last_0
1  1 1 0 1 1      c
2  0 0 0 0 1      d
3  1 1 0 1 1      c
4  1 1 1 1 1   <NA>
5  0 1 0 0 1      d
6  0 0 1 0 1      d
7  1 1 1 0 1      d
8  1 1 1 1 1   <NA>
9  1 0 1 0 1      d
10 0 0 0 1 1      c

【讨论】：

绝妙的策略！ +1

【解决方案3】：

这是我的方法：

library(dplyr)

have %>%
  purrr::pmap_dfr(\(...) tibble(...,
                                last_0 = which(c(...) == 0) %>%
                                  names %>%
                                  last))

返回：

# A tibble: 10 x 6
       a     b     c     d     e last_0
   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
 1     1     1     0     1     1 c
 2     0     0     0     0     1 d
 3     1     1     0     1     1 c
 4     1     1     1     1     1 NA
 5     0     1     0     0     1 d
 6     0     0     1     0     1 d
 7     1     1     1     0     1 d
 8     1     1     1     1     1 NA
 9     1     0     1     0     1 d
10     0     0     0     1     1 c

【讨论】：

有趣的选择是使用 base \() 而不是 tidyverse ~。我喜欢使用dplyr::last，因为这样可以节省我想出的tail(...,1) 的一些输入。
谢谢！我真的很喜欢您解决方案中的pmap_chr(cur_data(),。一定要记住！你知道有没有办法以 purrr 风格使用 (...) ？
哦，你做到了！没关系，很酷！ :)
当您使用~ 时，Tidyeval 会自动将所有参数分配给...。您还将获得.x、.y、..1、..2 等等。

【解决方案4】：

我们可以使用pmax

have$last_0 <- names(have)[replace(do.call(pmax, data.frame((have == 0) * col(have))), rowSums(have) == ncol(have), NA)]

另一个使用max.col的基本R选项

have$last_0 <- replace(names(have)[max.col(1 - have, "last")], rowSums(have) == ncol(have), NA)

这样

> have
   a b c d e last_0
1  1 1 0 1 1      c
2  0 0 0 0 1      d
3  1 1 0 1 1      c
4  1 1 1 1 1   <NA>
5  0 1 0 0 1      d
6  0 0 1 0 1      d
7  1 1 1 0 1      d
8  1 1 1 1 1   <NA>
9  1 0 1 0 1      d
10 0 0 0 1 1      c

【讨论】：

【解决方案5】：

您也可以使用以下解决方案，但它可能听起来有点冗长。由于已经建议了大多数可能的“row_wise”操作，我想我会尝试一些我以前从未做过的事情：

library(dplyr)
library(tidyr)

have %>% 
  mutate(id = row_number()) %>%
  pivot_longer(-id, names_to = "Last_0", values_to = "val") %>%
  group_by(id) %>% 
  arrange(desc(val), .by_group = TRUE) %>%
  slice_tail(n = 1) %>%
  mutate(Last_0 = ifelse(val == 1, "NA", Last_0)) %>%
  select(Last_0) %>%
  bind_cols(have) %>%
  relocate(Last_0, .after = last_col())

# A tibble: 10 x 7
# Groups:   id [10]
      id     a     b     c     d     e Last_0
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
 1     1     1     1     0     1     1 c     
 2     2     0     0     0     0     1 d     
 3     3     1     1     0     1     1 c     
 4     4     1     1     1     1     1 NA    
 5     5     0     1     0     0     1 d     
 6     6     0     0     1     0     1 d     
 7     7     1     1     1     0     1 d     
 8     8     1     1     1     1     1 NA    
 9     9     1     0     1     0     1 d     
10    10     0     0     0     1     1 c

【讨论】：

【解决方案6】：

虽然亲爱的@akrun 使用的strategy 非常棒，但是您可以简单地将您计算的值存储在临时变量中并像这样替换它。这只能在 dplyr 中使用cur_data()

library(dplyr)

have %>% mutate(last_0 = {xx <- names(.)[max.col(cur_data() == 0, ties.method = 'last')];
                          replace(xx, rowSums(cur_data() == 0) == 0, NA)})
#>    a b c d e last_0
#> 1  1 1 0 1 1      c
#> 2  0 0 0 0 1      d
#> 3  1 1 0 1 1      c
#> 4  1 1 1 1 1   <NA>
#> 5  0 1 0 0 1      d
#> 6  0 0 1 0 1      d
#> 7  1 1 1 0 1      d
#> 8  1 1 1 1 1   <NA>
#> 9  1 0 1 0 1      d
#> 10 0 0 0 1 1      c

^{由reprex package (v2.0.0) 于 2021-06-10 创建}

【讨论】：

【解决方案7】：

极简的base R方法（实际上只使用grep-搜索条件）：

data.frame( have,
     last_0=unlist(apply( have, 1, function(x){
                sol <- grep(0,x);
                if( length(sol > 0 )){
                  colnames(have)[sol[length(sol)]]
                }else{
                  NA } } )) )

   a b c d e last_0
1  1 1 0 1 1      c
2  0 0 0 0 1      d
3  1 1 0 1 1      c
4  1 1 1 1 1   <NA>
5  0 1 0 0 1      d
6  0 0 1 0 1      d
7  1 1 1 0 1      d
8  1 1 1 1 1   <NA>
9  1 0 1 0 1      d
10 0 0 0 1 1      c

【讨论】：