识别在 R 数据框中出现一定次数的值答案

【问题标题】：Identify values that appear a certain number of times in an R data frame识别在 R 数据框中出现一定次数的值
【发布时间】：2015-06-21 10:06:44
【问题描述】：

我有一个字符串数据框，其中大部分是重复的。我想确定此数据框中至少出现 x 次的值。

   df <- data.frame(x = c("str", "str", "str", "ing", "ing","."))
   occurs <- 3

数据框包含数百个独特的字符串，以及数以万计的元素。在此示例中，我如何确定哪些字符串至少出现了 3 次？具体来说，我想输出符合此标准的字符串的名称，而不是它们在数据框中的索引。

【问题讨论】：

标签： r

【解决方案1】：

也许 table 是您所需要的 - 这是一个基于您的代码的修改示例：

> df <- data.frame(x = c("str", "str", "str", "ing", "ing","."))
> df
    x
1 str
2 str
3 str
4 ing
5 ing
6   .
> table(df$x)

  . ing str 
  1   2   3 
> table(df$x) > 2

    .   ing   str 
FALSE FALSE  TRUE 
> names(which(table(df$x) > 2))
[1] "str"

【讨论】：

【解决方案2】：

你也可以使用count:

library(dplyr)
df %>% count(x)

这将调用n() 来计算每个x 的观察次数：

# Source: local data frame [3 x 2]
#
#     x n
# 1   . 1
# 2 ing 2
# 3 str 3

如果您只希望出现至少 3 次，请使用 filter()：

df %>% count(x) %>% filter(n >= 3)

这给出了：

# Source: local data frame [1 x 2]
# 
#     x n
# 1 str 3

最后，如果您只想提取与您的过滤条件相对应的因素：

df %>% count(x) %>% filter(n >= 3) %>% .$x

# [1] str
# Levels: . ing str

根据 @David 在 cmets 中的建议，您也可以使用 data.table:

library(data.table)
setDT(df)[, if(.N >= 3) x, by = x]$V1

或者

setDT(df)[, .N, by = x][, x[N >= 3]]

# [1] str
# Levels: . ing str

根据@Frank 的建议，您也可以使用table 的“主力”tabulate：

levels(df[[1]])[tabulate(df[[1]])>=3]

# [1] "str"

基准测试

df <- data.frame(x = sample(LETTERS[1:26], 10e6, replace = TRUE))
df2 <- copy(df)

library(microbenchmark)
mbm <- microbenchmark(
  base = names(which(table(df$x) >= 385000)),
  base2 = levels(df[[1]])[tabulate(df[[1]])>385000L],
  dplyr = count(df, x) %>% filter(n >= 385000) %>% .$x,
  DT1 = setDT(df2)[, if(.N >= 385000) x, by = x]$V1,
  DT2 = setDT(df2)[, .N, by = x][, x[N >= 385000]],
  times = 50
)

> mbm
#Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval  cld
#  base 495.44936 523.29186 545.08199 543.56660 551.90360 652.13492    50    d
# base2  20.08123  20.09819  20.11988  20.10633  20.14137  20.20876    50 a   
# dplyr 226.75800 227.27992 231.19709 228.36296 232.71308 259.20770    50   c 
#   DT1  41.03576  41.28474  50.92456  48.40740  48.66626 168.53733    50  b  
#   DT2  41.45874  41.85510  50.76797  48.93944  49.49339  74.58234    50  b

【讨论】：

我想知道library(data.table) ; setDT(df)[, if(.N >= occurs) x, by = x]$V1 的表现如何。或者setDT(df)[, .N, by = x][, x[N >= occurs]]（不确定哪个更好）
应该很快。让我将它添加到基准测试中。
添加时，不要在同一个数据集上运行。创建df2 <- copy(df)，然后在df2 上运行data.table 基准测试。否则，setDT 也会在 all 其他函数的第一次迭代中将 df 转换为 data.table。
我不认为我会关心这个操作的速度，但我的电脑上的基础胜利：base2 = levels(df[[1]])[tabulate(df[[1]])>385000L]
@Frank 是的，table 的“主力”tabulate 确实快得多。我相应地更新了基准。