R中的每日相关性答案

【问题标题】：Daily correlation in RR中的每日相关性
【发布时间】：2021-03-23 12:43:56
【问题描述】：

我想计算多个变量之间的每日相关系数和 p 值。数据分辨率为 1 小时，测量时间为 5 个月。我的数据集如下所示：

DateTime          Variable 1   Variable 2  Variable 3
[POSIXct]           [num]        [num]       [num]
2019-05-01 00:45             
2019-05-01 01:45
2019-05-01 02:45
2019-05-01 03:45
...

这里我想计算变量 1 和变量 2 之间的相关性，以及变量 1 和变量 3 之间的相关性。结果（相关系数和 p 值）应该存储在数据框中。

到目前为止，我已经能够计算并保存整个时间段的相关性和 p 值（为我的整个数据集提供一个相关系数）。使用以下代码：

b <- as.data.frame(matrix(NA, ncol=3))
colnames(b) <- c("Variable", "estimate", "p.value")

for (i in 2:3) {
  a <-  cor.test(df$Variable 1, df[,i], method = "kendall") 
  b[i,] <- cbind(colnames(df)[i], a$estimate, a$p.value)
}

我如何才能每天计算相关性？这意味着我获得了数据集每一天的相关值。如果有任何帮助，我将不胜感激！

【问题讨论】：

标签： r

【解决方案1】：

如果您愿意使用data.table，这与@bouncyball 在dplyr 中的出色建议data.table-speak 相同：

library(data.table)
DT <- as.data.table(mtcars)
DT[, c("est","pv") := cor.test(disp, hp)[c("estimate","p.value")], by =.(cyl)]

head(DT)
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb        est        pv
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>      <num>     <num>
# 1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4 -0.5136284 0.2383485
# 2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4 -0.5136284 0.2383485
# 3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1  0.4346051 0.1816262
# 4:  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1 -0.5136284 0.2383485
# 5:  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2  0.1182556 0.6872155
# 6:  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1 -0.5136284 0.2383485

您需要使用这种（和 bouncyball 的）技术手动处理变量对。

DT <- as.data.table(mtcars)
DT[, c("est1","pv1") := cor.test(disp, hp)[c("estimate","p.value")], by =.(cyl)]
DT[, c("est2","pv2") := cor.test(disp, mpg)[c("estimate","p.value")], by =.(cyl)]
DT[, c("est3","pv3") := cor.test(hp, mpg)[c("estimate","p.value")], by =.(cyl)]

head(DT)
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb       est1       pv1       est2         pv2       est3        pv3
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>      <num>     <num>      <num>       <num>      <num>      <num>
# 1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4 -0.5136284 0.2383485  0.1030827 0.825929685 -0.1270678 0.78602021
# 2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4 -0.5136284 0.2383485  0.1030827 0.825929685 -0.1270678 0.78602021
# 3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1  0.4346051 0.1816262 -0.8052361 0.002782827 -0.5235034 0.09839858
# 4:  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1 -0.5136284 0.2383485  0.1030827 0.825929685 -0.1270678 0.78602021
# 5:  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2  0.1182556 0.6872155 -0.5197670 0.056774876 -0.2836357 0.32575378
# 6:  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1 -0.5136284 0.2383485  0.1030827 0.825929685 -0.1270678 0.78602021

【讨论】：

【解决方案2】：

如果没有数据样本，很难重现您的确切问题，但这里有一个使用tidyverse 对 mtcars 数据集进行分组和计算的示例。

library(tidyverse)

mtcars %>%
    select(cyl, disp, hp) %>%
    group_by(cyl) %>% 
    summarise(disp_hp_cor_test = list(cor.test(disp, hp))) %>%
    mutate(disp_hp_cor = unlist(map(disp_hp_cor_test, "estimate")),
           disp_hp_pval = unlist(map(disp_hp_cor_test, "p.value")))

#     cyl disp_hp_cor_test disp_hp_cor disp_hp_pval
#   <dbl> <list>                 <dbl>        <dbl>
# 1     4 <htest>                0.435        0.182
# 2     6 <htest>               -0.514        0.238
# 3     8 <htest>                0.118        0.687

这里的主要思想是您正在尝试执行分组相关性测试，这可以使用来自 dplyr 的group_by 和summarise。然后我们只需要在mutate 中做一些工作来提取我们感兴趣的度量（相关性和 pvalue）。

另一个可能更简洁的选择是使用across。这也使得在同一个 data.frame 中处理多个 cor.test 变得更加容易：

mtcars %>%
    select(cyl, disp, hp, wt) %>%
    group_by(cyl) %>% 
    summarise(disp_hp_cor_test = list(cor.test(disp, hp)),
              disp_wt_cor_test = list(cor.test(disp, wt))) %>%
    mutate(across(c(disp_hp_cor_test, disp_wt_cor_test), 
                  list("estimate" = ~unlist(map(.x, "estimate")),
                       "p.value" = ~unlist(map(.x, "p.value")))))

    cyl disp_hp_cor_test disp_wt_cor_test disp_hp_cor_test_e~ disp_hp_cor_test_~ disp_wt_cor_test_~ disp_wt_cor_test~
  <dbl> <list>           <list>                         <dbl>              <dbl>              <dbl>             <dbl>
1     4 <htest>          <htest>                        0.435              0.182              0.857          0.000761
2     6 <htest>          <htest>                       -0.514              0.238              0.473          0.284   
3     8 <htest>          <htest>                        0.118              0.687              0.755          0.00179

【讨论】：

bouncyball，我正在做一些非常相似的事情。 cor.test 的列表列的方法是我暂时忘记的，但是 mutate/summarize 没有办法从函数中分配两个值吗？ data.table 模拟将是 DT[, c("est","pv") := cor.test(disp, hp)[c("estimate","p.value")], by =.(cyl)]。
我可能会使用 broom 包中的tidy 一次从cor.test 中提取多个列。
要真正回答您的问题，我认为没有好的方法可以做到这一点：github.com/tidyverse/dplyr/issues/5494
其实……行得通。直到我试过才相信。 mtcars %>% group_by(cyl) %>% mutate(as.data.frame(cor.test(disp,hp)[c("estimate","p.value")])) 有效。我今天学到了一些新东西，谢谢！