在R中的数据帧之间取平均值的函数答案

【问题标题】：Function to take averages between dataframes in R在R中的数据帧之间取平均值的函数
【发布时间】：2016-05-25 20:50:46
【问题描述】：

我真的不知道从哪里开始，所以我在这里问。我有两个数据框：

set.seed(21)
DF1 <- data.frame(year = c(seq(2000,2012,by=1)), 
              C1 = runif(13,0,1),
              C2 = runif(13,0,1),
              C3 = runif(13,0,1),
              C4 = runif(13,0,1),
              C5 = runif(13,0,1))

DF2 <- data.frame(column = c("C1", "C2", "C3", "C4", "C5"),
              start = c(2005,2001,2006,2005,2009),
              end = c(2012,2009,2011,2010,2012))

我需要编写一个具有以下步骤的函数：

对于 DF2 中的每一行：取 DF2$column 中的相应列并从 DF1 中取平均值。

例如：在 DF1$C1 中，取 2005 年到 2012 年之间的值的平均值
报告：DF2[1,1]、DF2[1,2]、DF2[1,3]、平均值 1

小于可用数据的值，例如：2002 - 5 = 1997 但在 DF1 中不可用，可以作为 NA。

示例输出：

    > DF2.out
      column start  end        m1 
    1     C1  2005 2012 0.9186834 
    2     C2  2001 2009        NA

提前感谢您的帮助！

【问题讨论】：

这是 3 个问题。我建议您在单独的问题中询问每个步骤，并且每次都给出一个示例输出，这样可以更轻松、更快地为您提供帮助
我明白，我真正追求的是把它做成一个完整的部分，我想我明白如何单独写出来，但我假设函数或循环中的实现会有所不同。
如果你有所有部分，那么你可以修改你的问题并把你有的示例脚本，并要求人们帮助你把它作为解决问题的函数或替代方法

标签： r dataframe mean data-manipulation

【解决方案1】：

我假设您的问题是关于通过您在另一个数据帧中的参数来总结一个数据帧。在这种情况下，下面的代码将对第 1 部分有所帮助。

library(dplyr)

apply.by.colname <- function(data, col.name, year.start, year.end) {

    data %>% 
        filter(year >= year.start & year <= year.end) %>% 
        select(matches(col.name))
}

new.df <- apply.by.colname(DF1, "C1", 2005, 2012)
sapply(new.df, mean)

对于完整的解决方案，您可能需要在其他自定义函数或apply 调用中使用此函数。

【讨论】：

【解决方案2】：

您可以使用mapply 将循环包裹在DF2 行上：

library(data.table) # using for convenience 
DT <- data.table(DF1)
res <- mapply(function(c, start, end) {
         res <- DT[year >= start & year <= end, mean(get(c))]
         return (res)
      } , as.character(DF2$column), DF2$start, DF2$end)
res <- data.frame(res)
res$column <- rownames(res)
res <- merge(DF2, res)
res 

#  column start  end       res
#1     C1  2005 2012 0.5861268
#2     C2  2001 2009 0.3942018
#3     C3  2006 2011 0.5853924
#4     C4  2005 2010 0.4904493
#5     C5  2009 2012 0.6783216

【讨论】：

返回：get(c) 中的错误：第一个参数无效
将 DF 更改为 :DF2 <- data.frame(column = c("C1", "C2", "C3", "C4", "C5"), start = c(2005,2001,2006,2005,2009), end = c(2012,2009,2011,2010,2012), stringsAsFactors = F)

【解决方案3】：

如果我正确解释了您的问题，如果您想要的是 DF1 中每列在 DF2 中的年份范围的子集之后的平均值，那么下面的示例应该可以满足您的需求：

# get the column names from DF2$column
c_list <- as.character(DF2$column)

# for each column name in c_list, store the start and end
# year, and find the mean of the column subset by year range
ml <- do.call(rbind, lapply(1:length(c_list), function(x){

  start <- DF2[x, "start"]
    end <- DF2[x, "end"]

  mean(DF1[DF1$year >= start & DF1$year <= end,  c_list[x]])

}))

# join the means with DF2
DF2.out <- cbind(DF2, ml)

> DF2.out
  column start  end        ml
1     C1  2005 2012 0.5861268
2     C2  2001 2009 0.3942018
3     C3  2006 2011 0.5853924
4     C4  2005 2010 0.4904493
5     C5  2009 2012 0.6783216

【讨论】：

【解决方案4】：

使用mapply 的另一个尝试应该很快，因为它只是一些矩阵索引和选择：

column <- match(DF2$column, names(DF1) )
start  <- match(DF2$start, DF1$year)
end    <- match(DF2$end, DF1$year)

m1 <- mapply(
  function(r1,r2,co) mean(DF1[cbind(seq(r1,r2), co)]),
  start,
  end,
  column 
)

data.frame(
  column=names(DF1)[column], 
  start=DF1$year[start],
  end=DF1$year[end],
  m1
)

#  column start  end        m1
#1     C1  2005 2012 0.5861268
#2     C2  2001 2009 0.3942018
#3     C3  2006 2011 0.5853924
#4     C4  2005 2010 0.4904493
#5     C5  2009 2012 0.6783216

【讨论】：