【问题标题】:How to find the outliers for two columns in data frame如何查找数据框中两列的异常值
【发布时间】:2020-02-16 09:23:47
【问题描述】:

我需要为每种类型的变量 2 和变量 3 获取变量 1 的异常值总数,然后将其显示在表格中。它还需要仅显示 variable4 大于 1.5 的情况。我让它工作了,但我认为我的代码有问题,因为每个的输出都是 0,这是不正确的。

当我执行 boxplot.stats(df$variable1)$out 时,我会得到大量异常值。但是当我使用下面的代码时,每个代码都显示为 0。

high <- mean(df$variable1) + sd(df$variable1) * 3
low <- mean(df$variable1) - sd(df$variable1) * 3

df%>%
  filter(varaible4>1.5)%>%
     group_by(variable2, variable3) %>% 
       tally(variable1 < low ||variable1 > high)

每个类型的变量 2 和变量 3 都会显示一个表格...但计数只显示每种类型的 0。

【问题讨论】:

  • 请查看this guide 并提供一些原始数据。

标签: r


【解决方案1】:

也许您可以使用scale 而不是定义highlow 阈值并使用tally

这是一个基于一些随机数据的实现:

library(dplyr)

df = data.frame(variable1 = runif(100,1,10),
                variable2 = round(runif(100,1,3)),
                variable3 = round(runif(100,1,3)),
                variable4 = runif(100,1,5))
df$variable1[c(5,13,95)] = 1000

df1 <- df %>% 
  filter(variable4>1.5)%>%
  group_by(variable2, variable3) %>% 
  mutate(individual_outliers = abs(scale(variable1) > 3),
         total_outliers = sum(individual_outliers))

> df1
# A tibble: 91 x 6
# Groups:   variable2, variable3 [9]
   variable1 variable2 variable3 variable4 individual_outliers total_outliers
       <dbl>     <dbl>     <dbl>     <dbl>               <int>          <int>
 1      6.86         2         3      2.82                   0              0
 2      4.89         1         2      3.27                   0              0
 3      4.19         2         3      3.03                   0              0
 4      2.05         2         3      2.31                   0              0
 5   1000            3         2      2.08                   1              1
 6      9.36         2         2      3.85                   0              0
 7      8.40         3         3      3.81                   0              0
 8      8.33         3         2      2.32                   0              1
 9      7.92         2         1      4.58                   0              0
10      8.13         3         1      2.48                   0              0
# ... with 81 more rows

【讨论】:

    【解决方案2】:

    数据:

    df <- data.frame(variable1 = runif(1000,1,10),
                     variable2 = round(runif(1000,1,3)),
                     variable3 = round(runif(1000,1,3)),
                     variable4 = runif(1000,1,5),
                     variable5 = rep(LETTERS[1:4], 250),
                     variable6 = rep(LETTERS[5:9], 200), stringsAsFactors = F)
    
    df$variable1[c(5,13,95)] = 1000
    

    多元异常值检测:

    # Create a grouping vector: 
    
    grouping_vars <- c("variable5", "variable6")
    
    # Split apply combine function: 
    
    tmp_df <- do.call(rbind, lapply(split(df[,sapply(df, is.numeric)], df[,grouping_vars]), function(x){
    
        # Calculate mahalanobis distance:
    
        md <- mahalanobis(x, colMeans(x), cov(x), inverted = FALSE)
    
        # Calculate the iqr of the md: 
    
        iqr <- quantile(md, .75) - quantile(md, .25)
    
        # Classify the lower threshold outliers:
    
        lwr <- ifelse(md > (quantile(md, .75) + (1.5 * iqr)) | (md < (quantile(md, .25) - (1.5 * iqr))),
    
                      "outlier",
    
                      "not outlier")
    
        # Classify the upper threshold outliers:
    
        upr <- ifelse(md > (quantile(md, .75) + (3 * iqr)) | (md < (quantile(md, .25) - (3 * iqr))),
    
                      "outlier",
    
                      "not outlier")
    
        # Bind all of the vecs together: 
    
        cbind(x, md, lwr, upr)
    
        }
    
       )
    
      )
    
    
    # Extract the group from the row names:
    
    tmp_df <- data.frame(cbind(df[,!(sapply(df, is.numeric))], 
    
                         grouping_vars = row.names(tmp_df), tmp_df), row.names = NULL)
    
    df <- tmp_df[,c(names(df), setdiff(names(tmp_df), names(df)))]
    

    单变量异常值检测:

    # Use boxplot stats mean(x) +- 1.5 * IQR: 
    
    outliers_classified <- do.call("rbind", lapply(split(df, df[,grouping_vars]), function(x){
    
          if(is.numeric(x)){
    
            ifelse(x %in% boxplot.stats(x)$out, NA, x)
    
          }else{
    
            x
    
          }
    
        }
    
      )
    
    )
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-07-19
      • 2020-07-18
      • 1970-01-01
      • 1970-01-01
      • 2015-03-13
      • 2019-02-06
      • 2022-12-22
      • 1970-01-01
      相关资源
      最近更新 更多