计算字符串序列的平均值，然后删除大于 R 中平均值的 2SD 的任何内容答案

【问题标题】：Calculate average for a sequence of strings, then remove anything greater than 2SD of the average in R计算字符串序列的平均值，然后删除大于 R 中平均值的 2SD 的任何内容
【发布时间】：2020-07-05 11:53:45
【问题描述】：

我有一个超过 10,000 行的大型数据集：df:

  User              duration

  amy                582         
  amy                27
  amy                592
  amy                16
  amy                250
  tom                33
  tom                10
  tom                40
  tom                100

期望的输出：

User               duration

amy                 582
amy                 592
amy                 250
tom                 33
tom                 10
tom                 40

基本上，这将从每个唯一用户均值中删除任何 2SD 的异常值。该代码将获取每个唯一用户的平均值，确定其平均值和标准差，然后删除平均值 > 2SD 的值。

输出：

structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L, 
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA, 
-9L))

这是我尝试过的：

first define average and standard deviation


      ave = ave(df$duration)
      sd =  sd(df$duration)

然后为此设置某种参数：

     for i in df {
     remove all if > 2*sd}

我不确定，想要一些建议。

【问题讨论】：

您的公式转换为df %>% group_by(User) %>% filter(duration < (mean(duration) + 2 * sd(duration)))
请让我试试这个
但它不会给出您显示的预期输出，因为 mean + 2* sd iss 861 for 'amy

标签： r dplyr lubridate stringr

【解决方案1】：

这是一种 data.table 方法，对于多行可能会更快。

library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000, 
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
   User duration
1   amy    50000
2   amy      582
3   amy       27
4   amy      592
5   amy       16
6   amy      250
7   tom       33
8   tom       10
9   tom       40
10  tom      100

代码

setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
               duration >= mean(duration) - (2 * sd(duration)),]
          ,by=User]
   User duration
1:  amy      582
2:  amy       27
3:  amy      592
4:  amy       16
5:  amy      250
6:  tom       33
7:  tom       10
8:  tom       40
9:  tom      100

【讨论】：

谢谢我试过了，但我仍然看到所有原始值
我认为您的问题是示例数据中没有任何值与平均值相差超过 2 SD。我已经用一个例子编辑了我的答案。
是的，你完全正确，我错了。您的代码完美运行！谢谢！

【解决方案2】：

我们可以使用dplyr，和between一起使用会更简洁

library(dplyr)
df %>% 
   group_by(User) %>%
   filter(between(duration, mean(duration) -  sd(duration), 
                           mean(duration) +   sd(duration)))

【讨论】：

@TanishaHudson 正如我所提到的，根据您的逻辑，这些值不会被过滤掉
好吧，那我一定是做错了什么。当我将其修改为 1SD 时，它可以工作了！这行得通，谢谢！我需要重新检查我的 2SD 计算
@TanishaHudson。您可以尝试仅加载 dplyr 的新 R 会话以及来自该 dput 的示例吗？
@TanishaHudson 我测试过。再次，但在我这边找不到任何错误
可能是某些包被屏蔽了。 dplyr 的一些功能

【解决方案3】：

您可以使用scale() 查找z 分数并保持绝对值小于2：

library(dplyr)

df %>%
  group_by(User) %>%
  filter(abs(scale(duration)) < 2)

# A tibble: 9 x 2
# Groups:   User [2]
  User  duration
  <fct>    <int>
1 amy        582
2 amy         27
3 amy        592
4 amy         16
5 amy        250
6 tom         33
7 tom         10
8 tom         40
9 tom        100

【讨论】：

【解决方案4】：

我们可以尝试使用dplyr中的mutate和filter函数

library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>% 
filter(duration < ave_plus2sd)

这将为您提供以下输出，允许将每个条目与用户的平均值 + 2*sd 进行比较。

# Groups:   User [2]
  User  duration ave_plus2sd
  <fct>    <int>       <dbl>
1 amy        582        861.
2 amy         27        861.
3 amy        592        861.
4 amy         16        861.
5 amy        250        861.
6 tom         33        122.
7 tom         10        122.
8 tom         40        122.
9 tom        100        122.

我们可以进一步添加%>% select (User,duration)来选择感兴趣的用户和时长列。

【讨论】：