异常值：如何在 R 中标记数据集中的异常值？ [复制]答案

【问题标题】：Outlier: How do I tag an outlier in a dataset in R? [duplicate]异常值：如何在 R 中标记数据集中的异常值？ [复制]
【发布时间】：2019-06-24 10:24:58
【问题描述】：

我正在尝试从我的数据集中提取异常值并相应地标记它们。

样本数据

     Doctor Name    Hospital Assigned         Region    Claims   Illness Claimed
1    Albert         Some hospital Center      R-1       20       Sepsis
2    Simon          Another hospital Center   R-2       21       Pneumonia
3    Alvin          ...                       ...       ...       ...
4    Robert
5    Benedict
6    Cruz

所以我试图将每个Doctor 和Claimed 某个Illness 分组到某个Region 中，并试图在其中找到异常值。

Doctor Name    Hospital Assigned         Region    Claims   Illness Claimed is_outlier
1    Albert    Some hospital Center      R-1       20       Sepsis         1
2    Simon     Another hospital Center   R-2       21       Pneumonia      0
3    Alvin       ...                       ...       ...       ...
4    Robert
5    Benedict
6    Cruz

我可以在 Power BI 中执行此操作。但我似乎无法在 R 中做到这一点。我猜这涉及到 dplyr 的 group_by() 函数。但我不确定。

这就是我想要实现的目标：

算法如下：

Read data
Group data by Illness
    Group by Region
    get IQR based on Claims Count
    if claims count > than (Q3 + 1.5) * IQR
        then tag it as outlier = 1
    else
        not an outlier = 0
Export data

我以前做过这个，但是这段代码循环遍历每个疾病并为每个应用线性回归。这是否接近我想要实现的目标？

# Loop through the dataframe and apply model
Ind <- sapply(split(df, list(df$Region,df$Illness_Code)), function(x)nrow(x)>1)

out <- lapply(
        split(df, list(df$Region, df$Illness_Code))[Ind],
         function(c){
          m <- lm(formula = COUNT ~ YEAR, data = c)
          coef(m)
         })

有什么想法吗？

【问题讨论】：

标签： r loops dplyr

【解决方案1】：

一种可能的解决方案是使用 group_by + boxplot_stats。第一个将执行所有组的组合，第二个将返回异常值：

library(dplyr)

df <- data.frame(doc = sample(x = letters[1:3], size = 1000, replace = T), 
                 illness = sample(x = LETTERS[1:3], size = 1000, replace = T),
                 claims = rpois(n = 1000, lambda = 10))

df %>%
  group_by(doc, illness) %>%
  mutate(ind_out = if_else(claims %in% boxplot.stats(claims)$out, 1, 0))

# A tibble: 1,000 x 4
# Groups:   doc, illness [9]
   doc   illness claims ind_out
   <fct> <fct>    <int>   <dbl>
 1 c     A            8       0
 2 c     A           13       0
 3 b     C           18       0
 4 b     C            8       0
 5 b     C            8       0
 6 b     B           12       0
 7 a     C           10       0
 8 b     C            9       0
 9 a     B           15       0
10 c     B            8       0
# … with 990 more rows

我希望它有效。

【讨论】：

谢谢，我试试这个。我有将近 2gb 的 .csv 文件。所以这可能需要一些时间。我会给你一个更新。