【问题标题】:Function to count NA values at each level of a factor计算每个因子水平的 NA 值的函数
【发布时间】:2013-05-28 07:41:03
【问题描述】:

我有这个数据框:

set.seed(50)
data <- data.frame(age=c(rep("juv", 10), rep("ad", 10)),
                   sex=c(rep("m", 10), rep("f", 10)),
                   size=c(rep("large", 10), rep("small", 10)),
                   length=rnorm(20),
                   width=rnorm(20),
                   height=rnorm(20))

data$length[sample(1:20, size=8, replace=F)] <- NA
data$width[sample(1:20, size=8, replace=F)] <- NA
data$height[sample(1:20, size=8, replace=F)] <- NA

   age sex  size      length       width      height
1  juv   m large          NA -0.34992735  0.10955641
2  juv   m large -0.84160374          NA -0.41341885
3  juv   m large  0.03299794 -1.58987765          NA
4  juv   m large          NA          NA          NA
5  juv   m large -1.72760411          NA  0.09534935
6  juv   m large -0.27786453  2.66763339  0.49988990
7  juv   m large          NA          NA          NA
8  juv   m large -0.59091244 -0.36212039 -1.65840096
9  juv   m large          NA  0.56874633          NA
10 juv   m large          NA  0.02867454 -0.49068623
11  ad   f small  0.29520677  0.19902339          NA
12  ad   f small  0.55475223 -0.85142228  0.33763747
13  ad   f small          NA          NA -1.96590570
14  ad   f small  0.19573384  0.59724896 -2.32077461
15  ad   f small -0.45554055 -1.09604786          NA
16  ad   f small -0.36285547  0.01909655  1.16695158
17  ad   f small -0.15681338          NA          NA
18  ad   f small          NA          NA          NA
19  ad   f small          NA  0.40618657 -1.33263085
20  ad   f small -0.32342568          NA -0.13883976

我正在尝试创建一个函数来计算数据框中三个因素的每个级别上的lengthwidthheight 的 NA 值的数量。我试过这个:

 exploreMissingValues <- function(dataframe, factors, variables){
  library(plyr)
  Variables <- list(variables)

  llply(Variables, function(x) ddply(dataframe, .(factors), 
                                     summarise, 
                                     number.of.NA=length(x[is.na(x)])))  
}

exploreMissingValues(data, 
                     c("age", "sex", "size"), 
                     c("length", "width", "height"))

...但这给出了一个错误。我怎样才能让这个函数在数据帧的每个级别返回 NA 值的数量?

【问题讨论】:

    标签: r plyr missing-data


    【解决方案1】:

    正在寻找这样的东西...???

    library(doBy)
    summaryBy(length+width+height~age+sex+size,
              data=data,
              FUN=function(x) sum(is.na(x)),
              keep.names=TRUE)
      age sex  size length width height
    1  ad   f small      3     4      4
    2 juv   m large      5     4      4
    

    【讨论】:

      【解决方案2】:

      data.table 方法:

      library(data.table)
      DT <- data.table(data)
      DT[, lapply(.SD, function(x) sum(is.na(x))) , by = list(age,sex,size)]
      ##    age sex  size length width height
      ## 1: juv   m large      5     4      4
      ## 2:  ad   f small      3     4      4
      

      以及使用colwiseddplyplyr 等效项

      ddply(data, .(age,sex,size), colwise(.fun = function(x) sum(is.na(x))))
      ##   age sex  size length width height
      ## 1  ad   f small      3     4      4
      ## 2 juv   m large      5     4      4
      

      您始终可以为by 组件使用列名向量

      by.cols <- c('age', 'sex' ,'size')
      # then the following will work....
      DT[, lapply(.SD, function(x) sum(is.na(x))), by = by.cols]
      ddply(data, by.cols, colwise(.fun = function(x) sum(is.na(x))))
      

      【讨论】:

        【解决方案3】:

        使用aggregate:

        nacheck <- function(var, factor)
            aggregate(var, list(factor), function(x) sum(is.na(x)))
        
        nacheck(data$length, data$age)
        nacheck(data$length, data$sex)
        nacheck(data$length, data$size)
        

        您也可以将 apply 这个添加到您的数据框,按每个因素获取每个因素的所有维度度量的 NA 计数。

        apply(data[,c("length","width","height")], 2, nacheck, factor=data$age)
        apply(data[,c("length","width","height")], 2, nacheck, factor=data$sex)
        apply(data[,c("length","width","height")], 2, nacheck, factor=data$size)
        

        要将这一切作为一个函数来完成,请将nacheck 嵌套在某个东西中,然后再嵌套lapply

        exploreNA <- function(df, factors){
            nacheck <- function(var, factor)
                aggregate(var, list(factor), function(x) sum(is.na(x)))
            lapply(factors, function(x) apply(df, 2, nacheck, factor=x))
        }
        
        exploreNA(data[,c("length","width","height")], list(data$age, data$sex, data$size))
        

        【讨论】:

        • 这不是我需要的。我特别需要制作一个可以处理不同输入的单行函数。
        • 现有功能的问题是它只能处理单个var。注意我有三个连续变量:lengthwidthheight
        • 感谢您的帮助 (+1)。但是,我仍然希望在 1 行中执行此操作。
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2021-12-15
        • 2019-01-22
        • 1970-01-01
        • 2019-08-25
        • 2021-11-10
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多