R - 条件标签，但不是第一个答案

【问题标题】：R - conditional labelling, but not the first oneR - 条件标签，但不是第一个
【发布时间】：2021-06-28 23:45:25
【问题描述】：

我有一个以下结构的数据集（虚拟数据，但与我的数据相似）：


data <- data.frame(msg = c("this is sample 1", "another text", "cats are cute", "another text", "", "...", "another text", "missing example case", "cats are cute"), 
                   no = c(1, 15, 23, 9, 7, 5, 35, 67, 35), 
                   pat = c(0.11, 0.45, 0.3, 0.2, 0.6, 0.890, 0.66, 0.01, 0))

我对@987654322@ 专栏感兴趣。我需要在新列（即usable）中用TRUE 或FALSE标记每一行。此标签必须在条件下完成：

如果msg 单元格为空（NA 或空字符串）=> FALSE
如果msg 单元格只有符号（没有字母没有数字）=> FALSE
如果msg 已经存在（假设行按升序排列）=> FALSE。请注意，第一个条目将为 TRUE，而重复的条目将为 FALSE。我不关心其他列（它们与比较无关），但就我的最终结果而言，我需要拥有所有列。

我用 for 做了一个非常冗长的方法，但我正在寻找更短且性能更好的方法，因为原始数据集很长。

【问题讨论】：

你可以试试transform(data, usable = with(data, grepl("[A-Za-z0-9]", msg) & !duplicated(msg)))。
如果您将此添加为答案，我会接受。像魅力一样工作

标签： r dataframe

【解决方案1】：

一个 tidyverse 选项。请注意，map2_lgl 是为了方便而不是速度。

library(dplyr)
library(purrr)
library(stringr)

data %>%
  mutate(id = row_number(),
         usable = map2_lgl(msg, id, 
                           ~ case_when(is.na(.x) | .x == '' ~ F,
                                       !str_detect(.x, '\\w') ~ F,
                                       .x %in% msg[1:.y-1] ~ F,
                                        T ~ T))) %>%
  select(-id)

#                    msg no  pat usable
# 1     this is sample 1  1 0.11   TRUE
# 2         another text 15 0.45   TRUE
# 3        cats are cute 23 0.30   TRUE
# 4         another text  9 0.20  FALSE
# 5                       7 0.60  FALSE
# 6                  ...  5 0.89  FALSE
# 7         another text 35 0.66  FALSE
# 8 missing example case 67 0.01   TRUE
# 9        cats are cute 35 0.00  FALSE

【讨论】：