【发布时间】:2021-06-17 01:47:00
【问题描述】:
我在 R 中有一个特定的数据整理任务,我需要将预测变量划分为重叠“邻域”(或范围)并将线性模型(简单/双变量)拟合到每个“邻域” " 以获取与 that "neighborhood" 中的 middle 预测变量相关联的拟合值。我正在按如下方式完成任务:
- 我为每个“邻居”创建虚拟变量(每列一个)
- 我将
lm()函数应用于数据子集,即虚拟变量为 1 的行,不包括等于 0 的行 - 我提取与每个“邻域”中的中间预测变量相关的拟合值
- 我最终得到一个长度等于“邻居”数量的拟合值向量
当重叠邻域的数量很少时,我的方法很有效。当重叠邻域的数量很大时,它相当冗长。这是一个可重现的示例(使用我创建的模拟数据,在这种情况下,社区数 = 7):
# Mock data
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
# Create dummies
new_data <- data %>%
mutate(neighborhood1 = ifelse(between(x, 0, 2.5), 1, 0),
neighborhood2 = ifelse(between(x, 0.5, 3.5), 1, 0),
neighborhood3 = ifelse(between(x, 1.5, 4.5), 1, 0),
neighborhood4 = ifelse(between(x, 2.5, 5.5), 1, 0),
neighborhood5 = ifelse(between(x, 3.5, 6.5), 1, 0),
neighborhood6 = ifelse(between(x, 4.5, 7.5), 1, 0),
neighborhood7 = ifelse(between(x, 5.5, 8.5), 1, 0))
# Run regression model on subsets of data
# Obtain fitted value Y at the middle X
# (in this example there are three obs per neighborhood and so we want the middle fitted value # 2)
Y_hat_1 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood1 == 1))[["fitted.values"]][[2]]
Y_hat_2 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood2 == 1))[["fitted.values"]][[2]]
Y_hat_3 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood3 == 1))[["fitted.values"]][[2]]
Y_hat_4 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood4 == 1))[["fitted.values"]][[2]]
Y_hat_5 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood5 == 1))[["fitted.values"]][[2]]
Y_hat_6 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood6 == 1))[["fitted.values"]][[2]]
Y_hat_7 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood7 == 1))[["fitted.values"]][[2]]
我一直想知道是否有更有效的方法来处理此任务(也许使用嵌套数据框或循环或任何 dplyr 或 data.table 函数可以使此任务更容易)。任何建议都会对我非常有帮助,非常感谢!!而且,很抱歉这个相当冗长的问题,因为我正试图更具体。非常感谢!
【问题讨论】:
-
另一种选择:如果您将所有
between的限制放在名为“rng”的data.table 中,其中包含“from”和“to”列,那么您可以进行非等值连接使用“数据”,并在i(by = .EACHI) 中为每个匹配运行模型:setDT(data)[rng, on = .(x >= from, x <= to), lm(y ~ x.x, data = .SD)[["fitted.values"]][[2]], by = .EACHI]
标签: r dplyr statistics data.table data-wrangling