R中动态组的线性回归答案

【问题标题】：Linear regression on dynamic groups in RR中动态组的线性回归
【发布时间】：2022-01-16 07:43:51
【问题描述】：

我有一个 data.table data_dt，我想在其上运行线性回归，以便用户可以使用变量 n_col 选择组 G1 和 G2 中的列数。以下代码完美运行，但由于创建矩阵花费了额外的时间，所以速度很慢。为了提高下面代码的性能，有没有办法通过调整lm函数的公式将步骤1、2和3一并删除，仍然得到相同的结果？

library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"]) 
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)]) 
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)]) 
lm(xx ~ G1 + G2)

结果 -

summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.763e-07 -4.130e-09  3.000e-09  9.840e-09  4.401e-07 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -4.931e-09  3.038e-09 -1.623e+00   0.1054    
G1LMI       -5.000e-01  4.083e-06 -1.225e+05   <2e-16 ***
G1MPI       -2.000e+00  4.014e-06 -4.982e+05   <2e-16 ***
G1ALT       -1.500e+00  5.556e-06 -2.700e+05   <2e-16 ***
G2LPP25      3.071e-04  1.407e-04  2.184e+00   0.0296 *  
G2LPP40     -5.001e+00  2.360e-04 -2.119e+04   <2e-16 ***
G2LPP60      1.000e+01  8.704e-05  1.149e+05   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.104e+12 on 6 and 370 DF,  p-value: < 2.2e-16

【问题讨论】：

你需要G1, G2 前缀在预测器上
不，不需要前缀。

标签： r data.table linear-regression

【解决方案1】：

只需使用reformulate 创建公式，这可能会更容易

out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)], 
     response = 'SPI'), data = data_dt)

-检查

> summary(out)

Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 
    2 + n_col)], response = "SPI"), data = data_dt)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.763e-07 -4.130e-09  3.000e-09  9.840e-09  4.401e-07 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -4.931e-09  3.038e-09 -1.623e+00   0.1054    
LMI         -5.000e-01  4.083e-06 -1.225e+05   <2e-16 ***
MPI         -2.000e+00  4.014e-06 -4.982e+05   <2e-16 ***
ALT         -1.500e+00  5.556e-06 -2.700e+05   <2e-16 ***
LPP25        3.071e-04  1.407e-04  2.184e+00   0.0296 *  
LPP40       -5.001e+00  2.360e-04 -2.119e+04   <2e-16 ***
LPP60        1.000e+01  8.704e-05  1.149e+05   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.104e+12 on 6 and 370 DF,  p-value: < 2.2e-16

【讨论】：

谢谢@akrun。您的解决方案将性能提高了 50%。
@Saurabh 您可以使用 fastlm 或 flm from collapse 来进一步提高速度
是的，我试过了，但是我需要稍后在wald test 中使用lm 的结果，这需要方差-协方差矩阵。不幸的是 fastlm 和 flm 不会产生 vcov 矩阵。