使用多个分组变量按组进行插值答案

【问题标题】：Interpolation by group with more then one grouping variables使用多个分组变量按组进行插值
【发布时间】：2020-11-16 18:07:17
【问题描述】：

我正在尝试对一年半不规则间隔测量的树木生长值进行线性插值。

我想按树、块和基因型信息对木材体积组进行每日线性插值。但是，我的代码中有些地方不对。我尝试了参数“do”和“mutate”，但没有一个工作。有人可以帮我吗？

bio2 <- read.xlsx("Cres_biomassa.xlsx", h=T, sheetName = "Original") 
str(bio2)
bio2$Block <- as.factor(bio2$Block)
bio2$Tree <- as.factor(bio2$Tree)
dput(bio2[1:10, ])

# structure(list(Date = structure(c(17537, 17593, 17628, 17656, 
# 17695, 17730, 17761, 17782, 17817, 17836), class = "Date"), Block = # structure(c(1L, 
# 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "2", "3"
# ), class = "factor"), Gen = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
# 1L, 1L, 1L, 1L), .Label = c("G1", "G10"), class = "factor"), 
#    Tree = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
#    ), .Label = c("1", "2"), class = "factor"), Volume = c(12.0152502828382, 
#    121.168369070794, 324.280007440298, 522.317155691492, 684.262691983242, 
#    742.921025749914, 775.35053835085, 804.747031488978, 996.719631625931, 
#    1358.37974592578)), row.names = c(NA, 10L), class = "data.frame")

library(lubridate)
#Dates for daily interpolation:
Dates <- seq.Date(ymd("2018-01-06"), ymd("2019-04-06"), by = 1)

test1 <- bio2 %>%
  group_by(Block, Gen, Tree) %>%
  mutate(ApproxFun <- approxfun(x = bio2$Date, y = bio2$Volume)
         LinearFit <- ApproxFun(Dates))

这个样本有两个基因型（Gen），两个树（Tree）和三个块（Blocks）。

【问题讨论】：

您能否通过发布一些示例数据使您的问题可重现？ dput() 是最好的方法，因为它可以复制/粘贴并保留类和结构信息。例如，dput(bio2[1:10, ]) 表示前 10 行。选择一个合适的子集，比如 10 行，每行有 2 个组中的几个缺失值。
感谢您的评论。我已经满足你的要求了。

标签： r dplyr interpolation linear-programming

【解决方案1】：

您的代码中的主要问题是approxfun() 返回一个函数，您不能直接将函数存储在数据框中。但是有一个解决方法：您可以将函数存储在数据框的列表列中。

（另外，在mutate() 中，你应该使用= 而不是<-，你不需要引用bio2 对象，并且你需要在两个mutate 语句之间使用逗号）

您可以使用nesting按组对数据进行子集化，并使用map()返回列表。

bio2 %>%
  group_by(Block, Gen, Tree) %>%
  nest(data = c(Date, Volume)) %>%
  mutate(ApproxFun = map(data, approxfun),
         LinearFit = map2(ApproxFun, data, ~.x(.y$Date)))

我们可以通过引入一些NA值来测试结果：

bio2_na <- bio2
bio2_na[c(3,7),"Volume"] <- NA_real_


bio2_na %>%
  group_by(Block, Gen, Tree) %>%
  nest(data = c(Date, Volume)) %>%
  mutate(ApproxFun = map(data, approxfun),
         LinearFit = map2(ApproxFun, data, ~.x(.y$Date))) %>%
  unnest(c(data, LinearFit))

# A tibble: 10 x 7
# Groups:   Block, Gen, Tree [1]
#   Block Gen   Tree  Date       Volume ApproxFun LinearFit
#   <fct> <fct> <fct> <date>      <dbl> <list>        <dbl>
# 1 1     G1    1     2018-01-06   12.0 <fn>           12.0
# 2 1     G1    1     2018-03-03  121.  <fn>          121. 
# 3 1     G1    1     2018-04-07   NA   <fn>          344. 
# 4 1     G1    1     2018-05-05  522.  <fn>          522. 
# 5 1     G1    1     2018-06-13  684.  <fn>          684. 
# 6 1     G1    1     2018-07-18  743.  <fn>          743. 
# 7 1     G1    1     2018-08-18   NA   <fn>          780. 
# 8 1     G1    1     2018-09-08  805.  <fn>          805. 
# 9 1     G1    1     2018-10-13  997.  <fn>          997. 
#10 1     G1    1     2018-11-01 1358.  <fn>         1358.

【讨论】：

不幸的是代码没有正确运行。我在data.frame中输入了我想用Volume ==“NA”进行插值的日期，但是它在插值中产生了NA值，并且一些值随着时间的推移而减小和增加，这是不正确的，因为增长总是随着时间的推移而增加。
我不知道。我认为我的回答会调用您在问题中正确编写的mutate()，但我不知道是什么导致完整数据集不正确。