【问题标题】:r remove outliers from a data frame with two identifiers by ddplyr 通过 ddply 从具有两个标识符的数据框中删除异常值
【发布时间】:2018-06-08 20:14:31
【问题描述】:

首先我要声明我对 R 语言的经验并不丰富。我有一个大的长格式数据框,以下面的 df 为例,有 3 列:GroupIDdat。我想删除每个“group-id”中的异常值(或者更确切地说用平均值替换)。

Group = c("1","1","2","2","3","3","1","1","2","2","3","3","1","1","2","2","3","3","1","1","2","2","3","3")
ID = c("Eb","Eb","Eb","Eb","Eb","Eb","Sd","Sd","Sd","Sd","Sd","Sd","Re","Re","Re","Re","Re","Re","Tf","Tf","Tf","Tf","Tf","Tf")
dat = c(2,3,4,5,6,7,8,9,1010,11,12,13,1,2,3,-10000,5,6,4,3,2,7,6666,5)
df = data.frame(Group,ID,dat)

我的基本方法(不起作用)如下(我已经尝试了这段代码的多次迭代):

library(outliers)
library(plyr)
# Function to remove outliers
RmOurliFUN = function(x){
                rm.outlier(x$dat, fill = TRUE)
}
# splitting data based on first Group, and then ID to apply the outlier removal
GroupSplit = function(x){ddply(x,"ID",RmOurliFUN)}
df2 = ddply(df1, "Group", GroupSplit)

我收到各种错误消息,但通常参数不是数字或逻辑。我很确定我没有正确调用 nested>nested 函数中的 dat 列。 如何执行这样的操作?我愿意接受任何建议。

【问题讨论】:

  • class(df1$dat) 是什么?听起来您需要将其转换为数字。
  • 同意 Esther - 如果 Group 是分类的,那么将其作为一个因素或字符类是有意义的,但看起来您正在尝试检测数字异常值。 2 是一个数字,"2" 是一个字符串,所以你的 dat 列可能是一个因素或一个字符。使用df$dat = as.numeric(as.character(df$dat)) 将其转换为数字,然后重试。
  • 对不起,我想我做了一个糟糕的示例数据集,我的实际数据是数字,但是当我将此数据集更改为数字时(我现在在上面的示例中已经这样做了)它仍然没有不行。还有as.character()as.numeric() for x$dat 也不能解决问题...

标签: r plyr outliers


【解决方案1】:

要删除Group+ID 的每个唯一组合中的异常值,您可以将函数直接添加到对ddply 的调用中,然后重新调整结果

library(outliers)
library(plyr)
library(reshape2)

#Make some new categories to have enough values for outlier detection
Group<-rep(c("a", "b"), each=12)
ID<-rep(c("c", "d"), each=6)
dat = c(2,3,4,5,6,7,8,9,1010,11,12,13,1,2,3,-10000,5,6,4,3,2,7,6666,5)
df1 = data.frame(Group,ID,dat)

df2<-ddply(df1, c("Group", "ID"), function(x) rm.outlier(x$dat, fill=TRUE))

#reshape and order the data
res<-melt(df2, id.vars=c("Group", "ID"), value.name = "dat")  
res<-arrange(res, Group, ID)[,-3]

【讨论】:

  • 谢谢以斯帖!它不仅完全符合我的要求,而且还消除了嵌套函数!导致更清晰的脚本。我非常感激!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-12-16
  • 2017-10-26
  • 2018-03-23
相关资源
最近更新 更多