【问题标题】:Calculate the average for each category, and fill in missing values with this average计算每个类别的平均值,并用这个平均值填充缺失值
【发布时间】:2021-03-22 17:13:55
【问题描述】:

我是r 的新手,正在尝试解决一个基本问题。

我有一个名为books 的小标题。其中一列是total_purchased(购买的图书总数),另一列是title(书名)。

total_purchased 列中有许多缺失值。我想将这些替换为每本书的平均购买量。但是,我不能真正让这个工作以一种有效的方式工作。下面我刚刚硬编码了书名。

例如,我

  1. 过滤total_purchased 列包含na 值的小标题,并按书籍title

  2. 计算mean

  3. 为每本书分别执行这些步骤。

  4. 使用mutate 函数添加一个新列,该列只是total_purchased 的一个副本,但它为每个na 值分配相关均值。

我基本上只需要了解如何简化它,这样我就不会硬编码书名,也可以减少代码量。我对r 有点太陌生,无法自己解决。在另一种语言中,我会在这里使用循环,但不确定是否可以使用一些矢量化来简单地做到这一点。

# Calculate mean total purchased for particular book.
SOR <- books %>%
                  filter(!(is.na(total_purchased))) %>%
                  filter(title == "Secrets Of R For Advanced Students") %>%
                    pull(total_purchased) %>%
                      mean

RFD <- books %>%
                  filter(!(is.na(total_purchased))) %>%
                  filter(title == "R For Dummies") %>%
                    pull(total_purchased) %>%
                      mean
FOR <- books %>%
                  filter(!(is.na(total_purchased))) %>%
                  filter(title == "Fundamentals of R For Beginners") %>%
                    pull(total_purchased) %>%
                      mean
RVP <- books %>%
                  filter(!(is.na(total_purchased))) %>%
                  filter(title == "R vs Python: An Essay") %>%
                    pull(total_purchased) %>%
                      mean
TTM <- books %>%
                  filter(!(is.na(total_purchased))) %>%
                  filter(title == "Top 10 Mistakes R Beginners Make") %>%
                    pull(total_purchased) %>%
                      mean
RME <- books %>%
                  filter(!(is.na(total_purchased))) %>%
                  filter(title == "R Made Easy") %>%
                    pull(total_purchased) %>%
                      mean

# Assign mean specific to book when total purchased value is na
books <- books %>%
                    mutate(complete_purchased = case_when(
                      is.na(total_purchased) & title == "Secrets Of R For Advanced Students" ~ SOR,
                      is.na(total_purchased) & title == "R For Dummies" ~ RFD,
                      is.na(total_purchased) & title == "Fundamentals of R For Beginners" ~ FOR,
                      is.na(total_purchased) & title == "R vs Python: An Essay" ~ RVP,
                      is.na(total_purchased) & title == "Top 10 Mistakes R Beginners Make" ~ TTM,
                      is.na(total_purchased) & title == "R Made Easy" ~ RME,
                      TRUE ~ total_purchased
                    ))

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    这是一个使用 tidyverse 的示例。我创建了一些虚拟数据来演示。

    您可以使用mutate 计算组的平均值,然后我们可以使用ifelse 创建替换列

    set.seed(1)
    
    dat <- data.frame(id = sample(letters[1:3], 10, replace = TRUE),
                      y = sample(c(NA, 1:2), 10, replace = TRUE))
    
    dat %>%
        group_by(id) %>%
        mutate(y_mean = mean(y, na.rm = TRUE)) %>%
        mutate(y_replace = ifelse(is.na(y), y_mean, y))
    
     #   id        y y_mean y_replace
     #   <chr> <int>  <dbl>     <dbl>
     # 1 a         2    1.5       2  
     # 2 c        NA    1         1  
     # 3 a        NA    1.5       1.5
     # 4 b        NA    1.5       1.5
     # 5 a         1    1.5       1  
     # 6 c         1    1         1  
     # 7 c         1    1         1  
     # 8 b         1    1.5       1  
     # 9 b         2    1.5       2  
     #10 c        NA    1         1  
    

    使用ave 的基础 R 中的单行:

    ifelse(is.na(dat$y), ave(dat$y, dat$id, FUN = function(x) mean(x, na.rm = TRUE)), dat$y)
    # [1] 2.0 1.0 1.5 1.5 1.0 1.0 1.0 1.0 2.0 1.0
    

    【讨论】:

    • 仅供参考,您可以在同一个 mutate() 电话中执行此操作
    【解决方案2】:

    我不熟悉dplyr,它可能有一个很好的解决方案,但是使用循环你可以遍历每个唯一的标题:

    for(i in unique(books$title)){
      title.index = books$title==i
      condition = is.na(books$total_purchased) &  title.index
      title.mean = mean(books$total_purchased[title.index])
      books$total_purchased[condition] = title.mean}
    

    Obs:你可以将整个代码写成一行,我把它分解成变量以便更容易理解。

    【讨论】:

    • 谢谢。虽然不会像books$total_purchased[title.index] 这样的东西只是抛出一个错误,或者不返回任何东西?因为total_purchased 列不包含标题。
    • 发布您的数据(粘贴dput(your_data) 的输出)以便我检查问题所在
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-02-06
    • 2021-12-25
    • 1970-01-01
    • 2021-02-05
    • 2021-11-28
    相关资源
    最近更新 更多