识别R中列中的重复字符串答案

【问题标题】：Identify duplicated character strings in columns in R识别R中列中的重复字符串
【发布时间】：2016-08-22 14:50:08
【问题描述】：

我有一个包含 4 列和 3000 行的数据框。我的目标是检查每一行是否有四个不同的字符串。例如：

第一行：希腊 - 俄罗斯 - 西班牙 - 荷兰
第二排：英国 - 德国 - 德国 - 伊朗
第三排：荷兰-荷兰-英国-希腊

因此，R 应该给我第 2 行和第 3 行，因为有重复。这可能吗？提前致谢。

【问题讨论】：

【解决方案1】：

我们可以使用apply和MARGIN =1循环遍历行，检查每行中unique元素的length是否不等于数据集的列数，得到一个逻辑向量，这可用于对数据集的行进行子集化，其中一行中至少有一个重复项。

df1[apply(df1, 1, FUN = function(x) length(unique(x)))!=ncol(df1),]
#       col1        col2    col3   col4
#2     England     Germany Germany   Iran
#3 Netherlands Netherlands Britain Greece

另一种选择是基于正则表达式的方法（应该更快），其中我们paste 每行的元素，grep 使用正则表达式获取重复字符串行的索引以对行进行子集化。

df1[grep("(\\b\\S+\\b)(?=.*\\1+)", do.call(paste, df1), perl = TRUE),]
#          col1        col2    col3   col4
# 2     England     Germany Germany   Iran
# 3 Netherlands Netherlands Britain Greece

基准测试

df2 <- df1[rep(1:nrow(df1), 1e6),]
system.time(df2[apply(df2, 1L, anyDuplicated),])
# user  system elapsed 
#  34.34    0.22   34.90 

system.time(df2[grep("(\\b\\S+\\b)(?=.*\\1+)", do.call(paste, df2), perl = TRUE),])
#   user  system elapsed 
#   9.53    0.05    9.61 

system.time(df2[apply(df2, 1, FUN = function(x) length(unique(x)))!=ncol(df2),])
#   user  system elapsed 
#  41.48    0.17   41.71

数据

df1 <- structure(list(col1 = c("Greece", "England", "Netherlands"), 
col2 = c("Russia", "Germany", "Netherlands"), col3 = c("Spain", 
"Germany", "Britain"), col4 = c("Netherlands", "Iran", "Greece"
 )), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA, 
 -3L), class = "data.frame")

【讨论】：

谢谢。我尝试了第一个版本（应用）并且效果很好。不幸的是，我有多个 NA 值。不应列出重复的 NA。因此，如果一行是英国 - 德国 - NA - NA，我不想看到这一行。如果你有这个问题的解决方案，我会很高兴。
@Lilly 我们可以创建一个逻辑索引来子集该行中没有 NA 的行，即i1 <- rowSums(is.na(df1))==0 使用该索引来子集行，并执行apply 部分，@987654333 @
在这种情况下，不会识别像“Britain - British - NA - NA”（重复记录 - 英国）这样的行，但我希望列出这些案例...对不起，有点复杂。
@Lilly 试试df1[apply(df1, 1, FUN = function(x) {x1 <- x[!is.na(x)]; length(unique(x1))!=length(x1) }), ]

【解决方案2】：

dplyr 和 tidyr 的解决方案

library(dplyr)
library(tidyr)

df_new <- df %>% 
    mutate(row = row_number()) %>% 
    gather(key, value, -row) %>% 
    group_by(row, value) %>% 
    mutate(n = n()) %>% 
    mutate(duplicate = ifelse(n > 1, TRUE, FALSE)) %>%
    # STOP HERE IF YOU WANT TO SEE DUPLICATES 
    filter(duplicate == TRUE) %>% 
    ungroup() %>% 
    # RUN DISTINCT IF YOU JUST WANT TO SEE ROWS WITH DUPES
    distinct(row)

3000 行的基准测试

dfL <- Reduce(rbind, list(df)[rep(1L, times=1000)])
system.time( ... )
#  user  system elapsed 
# 0.004   0.000   0.004

【讨论】：

@akrun 你是 100% 正确的，但它有 3000 行，非常用户友好:)
可能是，但是你需要写的行数让我有点担心。