【问题标题】:Assigning categorical values to NAs randomly or proportionally随机或按比例将分类值分配给 NA
【发布时间】:2019-02-23 20:44:23
【问题描述】:

我有一个数据集:

df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
"male"), Division = c("South Atlantic", "East North Central", 
"Pacific", "East North Central", "South Atlantic", "South Atlantic", 
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538, 
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn", 
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

我需要执行分析,以便我不能在 gender 变量中包含 NA 值。其他列太少并且没有已知的预测值,因此实际上不可能估算这些值。

我可以通过完全删除不完整的观察结果来执行分析 - 它们约占数据集的 4%,但我希望通过将 femalemale 随机分配到缺失的案例中来查看结果。

除了编写一些非常丑陋的代码来过滤到不完整的情况,分成两部分并将NAs 替换为femalemale 在每一半中,我想知道是否有一种优雅的方式来随机或按比例为NAs 赋值?

【问题讨论】:

    标签: r na


    【解决方案1】:

    我们可以使用ifelseis.na判断na是否存在,然后使用sample随机选择femalemale

    df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
    

    【讨论】:

    • 来自基础 R.
    • 与底座或 dplyr::sample 一起工作得很好。谢谢。
    【解决方案2】:

    这个怎么样:

    > df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
    +                                 "male"),
    +                      Division = c("South Atlantic", "East North Central", 
    +                                   "Pacific", "East North Central", "South Atlantic", "South Atlantic", 
    +                                   "Pacific"),
    +                      Median = c(57036.6262, 39917, 94060.208, 89822.1538,
    +                                 107683.9118, 56149.3217, 46237.265),
    +                      first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
    +                 row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
    > 
    > Gender <- rbinom(length(df$gender), 1, 0.52)
    > Gender <- factor(Gender, labels = c("female", "male"))
    > 
    > df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
    > 
    > df$gender
    [1] "female" "male"   "female" "female" "male"   "male"   "male"  
    > 
    

    那是随机的,具有给定的概率。您还可以考虑使用最近的邻居、常用办公桌或类似方法来估算值。

    希望对你有帮助。

    【讨论】:

      【解决方案3】:

      只分配

      df$gender[is.na(df$gender)]=sample(c("female", "male"), dim(df)[1], replace = TRUE)[is.na(df$gender)]
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-02-14
        • 2010-12-27
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-10-02
        相关资源
        最近更新 更多