【问题标题】:For each group, for each week, find the sum of the observations in the previous X weeks in R对于每组,每周,在 R 中找到前 X 周的观察值总和
【发布时间】:2019-06-12 11:14:00
【问题描述】:

对于每个组 (individual_id),对于每个 week_id,我想计算个人在过去 X 周内在每个城市出现的次数。

我已经尝试过 dplyr 无济于事。我尝试了一个循环,但它在我正在使用的数据集上花费了很长时间(在 20 个城市中对 > 1000 个人进行了大约 250,000 次观察。特别是当我想查找前两年的出现次数时(即 X = 104周)。

theDates = as.Date(c('07/05/2017','07/05/2017', '07/05/2017', '14/05/2017', '14/05/2017',
                     '21/05/2017','21/05/2017','21/05/2017', '28/05/2017', '04/06/2017', '04/06/2017', '04/06/2017', '11/06/2017',
                     '18/06/2017', '18/06/2017'), format='%d/%m/%Y')


someData = data.frame(individual_id = c(1,2,3,2,3,1,2,3,3,1,2,3,3,2,3), week_end_date=theDates, 
                      city=c('Chicago','Chicago','Chicago','Washington', 'Washington', 'Chicago','Chicago', 'Chicago','Washington',
                             'Washington', 'Washington','Washington','Chicago','Washington', 'Washington'))



someData$nChicagoAppearancesInLastXweeks = NA
someData$nWashingtonAppearancesInLastXweeks = NA

X = 4 # this is the number of weeks for the window length

someData$start_of_period_date = someData$week_end_date - 7*X  # this is the start of the range of dates to count appearances over

for (i in 1:dim(someData)[1]) {
  WEEK_IDS = seq(someData$start_of_period_date[i], someData$week_end_date[i]-1, by='days')
  INDIVIDUAL_ID = someData$individual_id[i]

someData$nChicagoAppearancesInLastXweeks[i] = sum(ifelse(someData$city=='Chicago' & someData$individual_id == INDIVIDUAL_ID & someData$week_end_date %in% WEEK_IDS,1,0))

someData$nWashingtonAppearancesInLastXweeks[i] = with(someData, sum(ifelse(city=='Washington' & individual_id == INDIVIDUAL_ID & week_end_date %in% c(WEEK_IDS),1,0)))
}

预期的输出将是两个新列,给出每个 individual_id 在过去 X 周内出现在每个城市的次数。循环代码可以做到这一点,但显然不是最好的方法。

【问题讨论】:

标签: r dplyr aggregate


【解决方案1】:

为每个添加的列执行左连接:

library(sqldf)

X <- 4
sql <- "select sum(not b.city is null)
  from someData a
  left join someData b on 
    b.city == '$lev' and 
    a.[individual_id] = b.[individual_id] and
    b.[week_end_date] between a.[week_end_date] - 7 * $X and a.[week_end_date] - 1
  group by a.rowid"

for(lev in levels(someData$city)) someData[lev] <- fn$sqldf(sql)

给予:

> someData
   individual_id week_end_date       city Chicago Washington
1              1    2017-05-07    Chicago       0          0
2              2    2017-05-07    Chicago       0          0
3              3    2017-05-07    Chicago       0          0
4              2    2017-05-14 Washington       1          0
5              3    2017-05-14 Washington       1          0
6              1    2017-05-21    Chicago       1          0
7              2    2017-05-21    Chicago       1          1
8              3    2017-05-21    Chicago       1          1
9              3    2017-05-28 Washington       2          1
10             1    2017-06-04 Washington       2          0
11             2    2017-06-04 Washington       2          1
12             3    2017-06-04 Washington       2          2
13             3    2017-06-11    Chicago       1          3
14             2    2017-06-18 Washington       1          1
15             3    2017-06-18 Washington       2          2

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-02-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-03-06
    • 1970-01-01
    相关资源
    最近更新 更多