【问题标题】:Create a character variable with values conditional on previous variables (both their names and their values) with dplyr使用 dplyr 创建一个字符变量,其值取决于先前的变量(它们的名称和值)
【发布时间】:2020-03-02 21:06:00
【问题描述】:

我有一个包含约 100 个变量和数千个观察值的数据框。由于某些变量中的值,其中一些观察结果不符合进一步分析的条件。我想创建一个字符变量,而不是仅仅删除这些不合格的观察结果,以指示如果观察结果已不合格,如果是,因为哪些变量(由于多个变量,一个观察结果可能会被取消资格)。

大多数变量都是数字变量,可以具有以下值之一:-101。除此之外,可能使观察不合格的变量也可以取值99,这意味着不合格。

# create example data
df <- data.frame(id = c(1:6),
                 AA_B = c(1, 0, NA, 1, -1, 99),
                 A_B_C = c(0, 0, 0, -1, 1, NA),
                 A_BB = c(-1, 99, 0, 0, -1, NA),
                 B_C = c(99, NA, 1, 99, 0, 99),
                 D_AC = c(1, 1, 1, 1, -1, -1))

如果观察结果不合格,则新变量“disqualify”应该类似于Disqualified because of A_BBDisqualified because of AA_B and B_C(取决于导致不合格的变量),否则它可以是任何其他字符串或只是NA。因此,结果应如下所示:

> df
  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

我正在努力寻找一种方法来自动将导致取消资格的那些变量的变量名称包含在“取消资格”字符串中。到目前为止,我已经找到了以下解决方案,但这是一种可怕的代码和平,我相信一定有更好的方法来做到这一点。

df <-
  df %>%
  mutate(disqualify = case_when(AA_B == 99 |
                                  A_BB == 99 | 
                                  B_C == 99 ~ paste("Disqualified because of",
                                                    case_when(AA_B == 99 & (is.na(A_BB) | A_BB != 99) & (is.na(B_C) | B_C != 99) ~ deparse(substitute(AA_B)),
                                                              AA_B == 99 & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), sep = " and "),
                                                              AA_B == 99 & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              AA_B == 99 & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ deparse(substitute(A_BB)),
                                                              (is.na(AA_B) | AA_B != 99) & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ deparse(substitute(B_C))
                                                              ))))

如果可能的话,我更喜欢 dplyr 解决方案,它允许我通过变量名称调用不合格变量(无索引)。

而且,最重要的是,如果有办法将我的输出变量中的变量名替换为另一个字符串,那就太好了。所以Disqualified because of A_BB 可以变成Disqualified because of Weather

感谢任何帮助!

【问题讨论】:

    标签: r string dplyr conditional-statements


    【解决方案1】:
    library(dplyr)
    df %>%
       #Check for 99 in specific columns
       mutate(disqualify = apply(.[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                       paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                       NA)))
    
      id AA_B A_B_C A_BB B_C D_AC                           disqualify
    1  1    1     0   -1  99    1          Disqualified because of B_C
    2  2    0     0   99  NA    1         Disqualified because of A_BB
    3  3   NA     0    0   1    1                                 <NA>
    4  4    1    -1    0  99    1          Disqualified because of B_C
    5  5   -1     1   -1   0   -1                                 <NA>
    6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C
    
    #Base R
    df$disqualify <- apply(df[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                                                 paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                                                 NA))
    

    在基础 R 中,我们可以 apply 在数据帧行/列上使用函数,具体取决于您传递 1 还是 2。这里我们需要在每一行中添加 apply 函数,因此我们使用了 1。有关更多信息,请参阅 ?apply详情。

    【讨论】:

    • 太好了,非常感谢!有没有办法将该函数仅应用于某些指定的变量?因为实际上大约 100 个变量中只有四个变量可能会使观察结果不合格。除此之外,其他一些变量的值可能为99。 (我当然可以简单地将取消限定符值更改为 -999999 之类的值,但为了提高我对 R 的理解,我想学习如何将此函数仅应用于一组预定义的变量。)
    • @MarkusG 抱歉回复晚了,请查看我的更新。
    • 这正是我想要的。非常感谢代码和解释!
    【解决方案2】:

    dplyrtidyr 选项可以是:

    df %>%
     left_join(df %>%
                pivot_longer(names_to = "variables", values_to = "values", -id, values_drop_na = TRUE) %>%
                group_by(id) %>%
                summarise(disqualify = if_else(all(values != 99), 
                                               NA_character_, 
                                               paste("Disqualified because of", paste0(variables[values == 99], collapse = " and ")))),
               by = c("id" = "id"))
    
      id AA_B A_B_C A_BB B_C D_AC                           disqualify
    1  1    1     0   -1  99    1          Disqualified because of B_C
    2  2    0     0   99  NA    1         Disqualified because of A_BB
    3  3   NA     0    0   1    1                                 <NA>
    4  4    1    -1    0  99    1          Disqualified because of B_C
    5  5   -1     1   -1   0   -1                                 <NA>
    6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C
    

    【讨论】:

      【解决方案3】:

      基本单线:

      df$disqualify <- apply(df,1,function(x)paste(names(which(x==99)),collapse = " and "))
      > df
        id AA_B A_B_C A_BB B_C D_AC      disqualify
      1  1    1     0   -1  99    1             B_C
      2  2    0     0   99  NA    1            A_BB
      3  3   NA     0    0   1    1                
      4  4    1    -1    0  99    1             B_C
      5  5   -1     1   -1   0   -1                
      6  6   99    NA   NA  99   -1   AA_B and B_C
      

      完全按照您想要的方式拥有它 - 您可以添加:

      df$disqualify &lt;- ifelse(test = df$disqualify=="", yes = NA, no = paste('Disqualified because of ',df$disqualify))

      > df
        id AA_B A_B_C A_BB B_C D_AC                            disqualify
      1  1    1     0   -1  99    1          Disqualified because of  B_C
      2  2    0     0   99  NA    1         Disqualified because of  A_BB
      3  3   NA     0    0   1    1                                  <NA>
      4  4    1    -1    0  99    1          Disqualified because of  B_C
      5  5   -1     1   -1   0   -1                                  <NA>
      6  6   99    NA   NA  99   -1 Disqualified because of  AA_B and B_C
      

      如果您想更改列的名称 - 为什么不在此操作之前使用
      names(df) &lt;- c("id","Weather", "Climate","Name3"...)

      【讨论】:

      • 我认为这个解决不了问题。问题是基于其他列创建disqualify 列。
      猜你喜欢
      • 1970-01-01
      • 2019-01-05
      • 1970-01-01
      • 2019-01-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-04-07
      相关资源
      最近更新 更多