【问题标题】:How to replace outlier values?如何替换异常值?
【发布时间】:2021-08-17 07:44:31
【问题描述】:

我有以下数据框

d <- data.frame("Open" = rnorm(10,5,1) )

如果我插入异常值

d$Open[4] = 100
d$Open[5] = 100

现在我想用正常值替换这些异常值。

我尝试用以前的值替换它,但如果一个接一个地出现异常值,它就不起作用。或者第一个元素是异常值。

d$Temp <- lag(d$Open ,1)
d$Temp <- ifelse(is.na(d$Temp), d$Open, d$Temp)

d$Open <- ifelse(
  ( ( d$Open - d$Temp)/ d$Temp) > 3,
  lag(d$Open, 1) ,
  d$Open
)

d <- d[,1]

有没有更好的方法去除它们?

目前我已将异常值定义为比前一个值多 300% 的任何值。

我对异常值的定义是错误的,这就是我要求任何新方法的原因。我猜 100% 以上比中值更适合定义评论中提到的异常值。谢谢:)


我的例子很简单,但如果我将它与股票价格一起使用,那么@Tarjae 识别异常值的方法就不起作用

dput(round(d$Open,0))
c(5, 5, 5, 5, 5, 5, 6, 6, 5, 5, 5, 6, 6, 5, 6, 7, 6, 6, 7, 6, 
6, 5, 6, 7, 8, 8, 9, 8, 7, 7, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 
5, 6, 6, 6, 5, 5, 6, 6, 7, 6, 6, 6, 6, 6, 7, 6, 6, 6, 6, 6, 6, 
5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 7, 7, 7, 
7, 7, 9, 8, 8, 9, 8, 8, 10, 11, 10, 10, 10, 9, 12, 10, 9, 10, 
9, 11, 11, 11, 11, 10, 10, 10, 8, 8, 9, 8, 8, 8, 8, 7, 7, 8, 
8, 8, 8, 9, 8, 6, 8, 8, 7, 7, 7, 8, 8, 9, 10, 10, 11, 11, 10, 
10, 11, 11, 11, 11, 13, 13, 14, 13, 13, 15, 16, 16, 16, 18, 17, 
17, 17, 17, 18, 16, 15, 14, 17, 16, 17, 16, 16, 17, 18, 23, 19, 
18, 18, 17, 17, 16, 17, 18, 17, 16, 16, 16, 16, 15, 16, 16, 18, 
22, 24, 24, 30, 31, 31, 31, 32, 40, 41, 39, 38, 36, 34, 33, 33, 
38, 38, 36, 34, 34, 36, 33, 35, 34, 33, 33, 34, 33, 33, 34, 33, 
38, 37, 38, 37, 40, 42, 40, 43, 45, 46, 40, 39, 41, 33, 30, 32, 
29, 32, 30, 30, 29, 32, 31, 34, 35, 34, 34, 33, 32, 31, 32, 33, 
31, 30, 31, 36, 36, 34, 35, 36, 33, 32, 33, 32, 35, 35, 40, 40, 
42, 38, 39, 39, 36, 37, 38, 36, 38, 39, 40, 40, 40, 41, 45, 49, 
50, 50, 49, 48, 50, 49, 50, 57, 58, 54, 54, 54, 49, 49, 58, 54, 
50, 52, 70, 75, 77, 78, 82, 98, 102, 121, 129, 126, 125, 124, 
118, 138, 147, 126, 107, 97, 100, 89, 103, 93, 93, 74, 72, 68, 
82, 82, 81, 95, 98, 94, 96, 95, 94, 90, 85, 91, 76, 72, 72, 78, 
81, 83, 85, 84, 80, 80, 79, 78, 65, 63, 46, 55, 41, 36, 32, 40, 
33, 24, 21, 19, 21, 30, 27, 28, 30, 25, 24, 21, 24, 25, 24, 22, 
22, 20, 23, 25, 27, 30, 34, 34, 37, 35, 41, 49, 58, 55, 51, 54, 
52, 51, 50, 41, 44, 51, 54, 48, 52, 54, 54, 57, 77, 70, 69, 68, 
68, 74, 70, 62, 74, 73, 74, 73, 72, 76, 73, 79, 82, 88, 91, 90, 
82, 81, 82, 86, 88, 94, 91, 95, 97, 103, 102, 94, 102, 93, 87, 
79, 79, 80, 77, 80, 83, 86, 81, 82, 80, 84, 85, 83, 83, 84, 85, 
92, 88, 90, 88, 91, 93, 90, 91, 87, 84, 89, 81, 75, 76, 70, 68, 
71, 74, 68, 68, 64, 54, 58, 59, 58, 52, 52, 52, 54, 61, 59, 57, 
64, 63, 64, 60, 58, 54, 55, 55, 56, 51, 54, 56, 52, 51, 49, 41, 
39, 38, 37, 37, 35, 38, 36, 34, 32, 32, 31, 33, 32, 32, 28, 25, 
25, 24, 22, 23, 23, 23, 26, 31, 31, 32, 32, 37, 36, 37, 36, 36, 
34, 32, 32, 33, 31, 30, 30, 27, 24, 25, 25, 25, 24, 25, 30, 30, 
30, 31, 29, 27, 27, 27, 27, 24, 24, 22, 24, 25, 26, 28, 30, 30, 
29, 28, 30, 31, 31, 33, 34, 34, 34, 33, 35, 32, 30, 30, 31, 29, 
27, 27, 26, 26, 23, 21, 24, 25, 26, 25, 24, 27, 25, 25, 24, 24, 
24, 23, 22, 24, 24, 25, 24, 23, 22, 22, 20, 21, 21, 22, 22, 22, 
21, 23, 24, 24, 25, 26, 26, 28, 26, 27, 27, 27, 30, 30, 33, 36, 
33, 31, 30, 32, 31, 33, 30, 30, 29, 34, 35, 37, 40, 37, 37, 37, 
37, 41, 42, 41, 44, 41, 39, 38, 43, 38, 40, 39, 42, 44, 43, 39, 
41, 40, 38, 34, 24, 24, 23, 25, 27, 28, 27, 32, 28, 27, 29, 26, 
27, 26, 29, 28, 28, 29, 31, 28, 28, 31, 29, 28, 25, 26, 23, 23, 
24, 23, 20, 21, 20, 21, 20, 18, 17, 18, 21, 19, 19, 19, 21, 19, 
19, 19, 17, 16, 15, 14, 16, 15, 14, 15, 17, 17, 18, 16, 15, 15, 
14, 13, 14, 12, 12, 12, 12, 11, 9, 8, 9, 9, 8, 8, 7, 10, 11, 
12, 11, 11, 12, 14, 14, 15, 14, 14, 12, 13, 12, 12, 14, 15, 17, 
16, 16, 16, 17, 17, 15, 15, 13, 14, 13, 13, 13, 13, 15, 14, 15, 
17, 16, 14, 12, 15, 16, 15, 16, 15, 15, 18, 21, 20, 19, 19, 18, 
19, 18, 19, 18, 18, 18, 18, 21, 20, 22, 20, 20, 20, 19, 18, 18, 
19, 20, 18, 19, 19, 20, 21, 23, 22, 20, 20, 19, 20, 22, 23, 25, 
25, 28, 26, 27, 26, 25, 25, 24, 23, 22, 23, 21, 23, 24, 26, 30, 
28, 27, 20, 23, 21, 24, 22, 19, 19, 18, 20, 25, 24, 25, 23, 24, 
23, 23, 24, 22, 32, 29, 28, 27, 27, 24, 25, 28, 28, 29, 28, 28, 
30, 34, 33, 32, 29, 26, 29, 27, 27, 41, 43, 42, 40, 41, 34, 36, 
37, 36, 36, 33, 34, 32, 32, 30, 29, 33, 34, 36, 34, 37, 41, 40, 
36, 33, 33, 32, 32, 33, 33, 32, 31, 30, 27, 30, 30, 29, 26, 31, 
26, 26, 23, 23, 27, 27, 29, 26, 27, 27, 26, 28, 29, 31, 33, 31, 
31, 29, 27, 28, 28, 27, 27, 28, 27, 26, 25, 26, 24, 25, 24, 20, 
17, 14, 16, 15, 15, 15, 14, 15, 14, 14, 14, 15, 16, 15, 21, 20, 
19, 19, 18, 18, 18, 22, 26, 26, 28, 27, 28, 24, 25, 24, 22, 22, 
22, 21, 24, 26, 27, 26, 26, 27, 27, 34, 37, 34, 32, 30, 32, 34, 
30, *2900*, 31, 36, 32, 34, 33, 37, 37, 37, 42, 57, 52, 54, 52, 
51)

在上面的数据中只有一个异常值是 2900

【问题讨论】:

  • 当您使用rnorm 生成值时,为什么异常值取决于先前的值?为什么你不使用“比中位数或平均值多 X%”之类的东西?然后您可以发现所有异常值,然后是最大的非异常值,然后用最大的非异常值替换所有异常值。
  • 异常值的定义令人困惑?为什么只与以前的值比较?如果您正在与之前的值进行比较,那么不会有导致第一个值作为异常值的情况!此外,如果您的第二个值与第一个值相比是异常值,如何进行异常值的第三个值测试。建议您可以在一些复杂的情况下定义逻辑,而不是创建随机案例,例如 c(1, 300, 3000, 2, 2000) 。你能告诉你在这种情况下你想要的输出是什么吗?
  • 您可以在此处查看选项。 - stackoverflow.com/questions/4787332/…
  • 在您修改后的数据中,我还有 2 个问题。 1) 如果你的价值观呈下降趋势怎么办。如果所有值都是正数,就不可能有比之前的值低 300% 的值吗? 2)2900 后跟的值是31,这比2900 低得多被认为是异常值?
  • 您永远不应该简单地替换“异常值”。这样做会使您的整个数据分析无效。

标签: r outliers


【解决方案1】:
# Your dataframe
df <- data.frame("Open" = rnorm(10,5,1) )

# Adding outliers
df$Open[4] = 100
df$Open[5] = 100

# Visualize outliers
boxplot(d)

# create a vector of outliers for the numeric factor
outliers <- boxplot(df$Open, plot = FALSE)$out

# Replace the outliers with NA (or whatever you want)
df[df$Open %in% outliers, "Open"] = NA
df

输出:

       Open
1  4.589664
2  4.621286
3  7.317407
4        NA
5        NA
6  3.490202
7  3.536626
8  2.825471
9  5.710270
10 5.541880

【讨论】:

  • 我可以在最后一步之后用最接近的非 NA 值替换 NA 吗?
  • 您好,您能检查一下我发布的新数据吗?它不适用于那个。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-04-13
  • 2019-03-14
  • 2018-01-05
  • 1970-01-01
  • 2020-06-09
  • 1970-01-01
相关资源
最近更新 更多