R - 考虑重叠日期查找多行之间的相关性答案

【问题标题】：R - Find correlation between multiple rows considering overlapping datesR - 考虑重叠日期查找多行之间的相关性
【发布时间】：2020-03-02 14:30:04
【问题描述】：

我有一个数据表，其中包含多个雨量计的降雨量测量值。这是我的数据集的示例：

library(data.table)
dat <- fread("https://www.dropbox.com/s/yub3db3739d80h2/dat.csv?dl=1")
> dat
         ID       date value
    1:  937 2000-01-01  14.2
    2:  937 2000-01-02  68.3
    3:  937 2000-01-03  28.4
    4:  937 2000-01-04  30.2
    5:  937 2000-01-05  12.8
   ---                      
33905: 1600 2017-06-12   0.1
33906: 1600 2017-06-13  36.1
33907: 1600 2017-06-14   0.3
33908: 1600 2017-06-15   0.0
33909: 1600 2017-06-16   0.0

我还有一个数据表，其中包含每个仪表的 ID 以及最近的几个仪表的 ID，以及它们的降雨测量的常见日期：

neighbors <- fread("https://www.dropbox.com/s/phhskbhxsxmrxy1/neighbours.csv?dl=1")
> neighbors
      ID ID_nearest common_date_begin common_date_end diff_days
 1:    1       1117        2000-03-01      2006-12-03      2468
 2:    1        920        2000-03-01      2004-11-04      1709
 3: 1000         48        2000-03-01      2006-12-03      2468
 4: 1000       1600        2000-03-01      2017-06-16      6316
 5: 1000        937        2000-03-01      2017-01-22      6171
 6: 1001        352        2007-07-10      2017-06-16      3629
 7: 1001        324        2007-07-10      2017-06-16      3629
 8: 1002       1338        2006-01-01      2017-06-16      4184
 9: 1002        412        2006-01-01      2009-07-12      1288
10: 1002       1330        2006-01-01      2017-06-16      4184
11: 1002       1349        2006-01-01      2017-06-16      4184
12: 1009        801        2006-01-01      2017-01-22      4039

例如，仪表 ID 1 有两个近邻：ID 的 1117 和 920。台站1 和1117 的重叠测量周期从 2000 年 3 月 1 日到 2006 年 12 月 3 日。

对于neighbors 中这样的每个组合，我需要计算重叠日期内主要和周围仪表之间的降雨测量值的相关性。

例如，第一对的相关性可以这样计算：

cor(dat[ID==1 & date %between% c("2000-03-01", "2006-12-03")]$value,
    dat[ID==1117 & date %between% c("2000-03-01", "2006-12-03")]$value)

cor(dat[ID==1 & date %between% c("2000-03-01", "2004-11-04")]$value,
    dat[ID==920 & date %between% c("2000-03-01", "2004-11-04")]$value)

预期的输出将是这样的：

  ID ID_nearest correlation    n
   1       1117        0.55 2468
   1        920        0.48 1709
1000         48        0.77 2468
1000       1600        0.52 6316
1000        937        0.84 6171

对于neighbors 中的每个ID，依此类推。

但我很难想出一种编程方式来实现这一点。

我该怎么做？提前致谢。

【问题讨论】：

为什么不用循环？
@nigelhenry 因为通常情况下，它们在 R 中非常缓慢且效率低下。
...但是您通过简单的循环接受了答案
嗯，这是真正有效的答案，我真的需要完成这个分析才能继续我的研究。

标签： r data.table correlation

【解决方案1】：

试试这个


library(data.table)
dat <- fread("https://www.dropbox.com/s/yub3db3739d80h2/dat.csv?dl=1")
neighbors <- fread("https://www.dropbox.com/s/phhskbhxsxmrxy1/neighbours.csv?dl=1")

results <- neighbors[, -c(3:4)]

i <- as.numeric(neighbors[1, 1])

correlations <- matrix(NA, nrow = nrow(neighbors), ncol =1)

ids <- unique(neighbors$ID)

x <- 1

for (i in ids) {

  temp <- neighbors[ID==i]

  for (id in 1:nrow(temp)){

    near_id <- as.numeric(temp[id, 2])

    beg_date <- temp[id, 3]

    end_date <- temp[id, 4]

    correlations[x,1] <- cor(dat[ID==i & date %between% c(beg_date, end_date)]$value,
          dat[ID==near_id & date %between% c(beg_date, end_date)]$value)

    x <- x + 1
  }

}

results <- cbind(results[, 1], results[, 2], correlations, results[, 3])

colnames(results) <- c("ID", "ID_nearest", "correlation", "n")

【讨论】：

感谢您的建议。但是，我在循环期间收到以下错误：Error in .prepareFastSubset(isub = isub, x = x, enclos = parent.frame(), : RHS of == is length 5 which is not 1 or nrow (12). For robustness, no recycling is allowed (other than of length 1 RHS). Consider %in% instead. In addition: Warning messages: 1: In if (neighbors[loop, ] != i) { : the condition has length > 1 and only the first element will be used 2: In if (neighbors[loop, ] != i) { : the condition has length > 1 and only the first element will be used
循环适用于您在此处提供的数据子集。我需要访问完整数据才能调试错误
我不同意。我在使用library(data.table) dat <- fread("https://www.dropbox.com/s/yub3db3739d80h2/dat.csv?dl=1") neighbors <- fread("https://www.dropbox.com/s/phhskbhxsxmrxy1/neighbours.csv?dl=1") 加载数据后收到错误消息。也许您在运行代码之前对数据变量进行了任何更改？
你是对的，我已经编辑了我的答案。现在可以了。
谢谢，现在可以使用了！有点慢，但肯定会产生我预期的输出。

【解决方案2】：

这是一种方法

> df <- do.call(rbind, lapply(unique(neighbors$ID), function(id) {
    d <- neighbors[neighbors[, "ID"] %in% id, ]
    main.vals <- dat %>%
        dplyr::filter(ID == id & (date >= d$common_date_begin & date <= max(d$common_date_end))) %>%
        dplyr::select(value)
    main.vals <- main.vals$value
    nearest.vals <- lapply(unique(d$ID_nearest), function(neigh.id) {
        r <- d[d$ID_nearest== neigh.id, ]
        vals <- dat[dat$ID == neigh.id & (dat$date >= r$common_date_begin & dat$date <= r$common_date_end), ]
        return (vals$value)
    })
    d <- d %>%
        dplyr::select(-c(common_date_begin, common_date_end)) %>%
        dplyr::mutate(correlation = sapply(nearest.vals, cor, y = main.vals),
                      n = diff_days)    
    return(d)
}))
> df
#   ID ID_nearest diff_days correlation    n
# 1  1       1117      2468    0.527024 2468
# 2  1        920      1709   -0.469635 1709

我们循环遍历邻居数据中每个唯一的ID，从dat date.frame 中过滤掉它的值，随后过滤掉neighbors data.frame 中每个邻居的值，并检查主id对应的降雨量与每个邻居id的降雨量的相关性。

我使用了以下数据（修改为将ID_nearest 值添加到dat）：

library(dplyr)
library(magrittr)

dat <- read.table(text = "
    1   2000-03-01  55.3
    1   2000-03-02  55.6
    1   2005-03-03  48.3
    920 2000-03-01  14.2
    920 2000-04-02  68.3
    920 2000-04-03  68.4
    1117 2003-03-01   0.1
    1117 2003-06-13  36.1
    1117 2003-06-14   0.3
", col.names = c("ID", "date", "value"))
dat$date <- as.POSIXct(dat$date)

neighbors <- read.table(text = "
  ID ID_nearest common_date_begin common_date_end diff_days
   1       1117        2000-03-01      2006-12-03      2468
   1        920        2000-03-01      2004-11-04      1709
", header = TRUE)
neighbors$common_date_begin <- as.POSIXct(neighbors$common_date_begin)
neighbors$common_date_end <- as.POSIXct(neighbors$common_date_end)

【讨论】：

【解决方案3】：

你可以先试试这个：

DT <- rnfl[neighbors, on=.(ID, date>=common_date_begin, date<=common_date_end),
    c(mget(paste0("i.", names(neighbors))), 
    by=.EACHI,
    .(date=x.date, v1=x.value))][, (1L:3L) := NULL]
setnames(DT, names(DT), gsub("i.", "", names(DT), fixed=TRUE))

DT[rnfl, on=.(ID_nearest=ID, date), v2 := value]
DT[, .(correlation=cor(v1, v2)), names(neighbors)]

如果速度太慢，我们可以尝试其他方法。

数据也来自prev qn：

library(data.table)
rnfl <- data.table(ID=c(1,1,1,1,1,2,2,2,2,2),
    date=Sys.Date() + c(0:4, 2:6),
    value=c(17.6, 5.6, 4.5, 8.3, 11.7, 10.7, 15.6, 11.6, 8.3, 2.3))
near <- data.table(ID=1, ID_nearest=2)

summ <- rnfl[, .(startdate=date[1L], enddate=date[.N]),
    .(ID, g=cumsum(c(0L, diff(date)!=1L)))]

setkey(summ, startdate, enddate)
olap <- unique(foverlaps(summ, summ)[ID!=i.ID, .(
    ID1=pmin(ID, i.ID),
    ID2=pmax(ID, i.ID),
    common_date_begin=pmax(startdate, i.startdate),
    common_date_end=pmin(enddate, i.enddate))])

near[, c("ID1", "ID2") := .(pmin(ID, ID_nearest), pmax(ID, ID_nearest))]

cols <- c("common_date_begin", "common_date_end")
neighbors <- near[olap, on=.(ID1, ID2), (cols) := mget(paste0("i.", cols))][,
    n := as.integer(common_date_end - common_date_begin)]

【讨论】：

再次感谢您的帮助！在第一行，使用我的真实数据，我得到了和以前一样的错误：Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 9024561 rows; more than 8660131 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
我知道！即使对于data.table，我的数据集也具有挑战性！如果您可以访问您的 hotmail 帐户，我绝对可以将文件发送给您，而不是 Dropbox...
这是一个内存问题，我认为不是因为data.table。我添加了另一种方法。
纳达。每一步后都得到相同的错误消息:(。
@thiagoveloso，您的 rnfl.csv 没有值列。我根据错误消息在第一次尝试中添加了by=.EACHI。