【问题标题】:Rolling conditional count in RR中的滚动条件计数
【发布时间】:2018-04-04 14:44:08
【问题描述】:

我想创建一个滚动函数,有条件地计算前行中两列的出现次数。

例如,我有一个如下所示的数据集。

# Generate data
set.seed(123)
test <- data.frame(
  Round = rep(1:5, times = 3),
  Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
  Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
)

   Round   Team   Venue
1      1 Team 1 Venue B
2      2 Team 1 Venue B
3      3 Team 1 Venue A
4      4 Team 1 Venue A
5      5 Team 1 Venue B
6      1 Team 2 Venue B
7      2 Team 2 Venue B
8      3 Team 2 Venue A
9      4 Team 2 Venue A
10     5 Team 2 Venue A
11     1 Team 3 Venue B
12     2 Team 3 Venue A
13     3 Team 3 Venue B
14     4 Team 3 Venue B
15     5 Team 3 Venue B

我想要一个新列,显示每一行中该行的球队在过去 3 轮比赛中在该行的场地进行比赛的次数。

我可以很容易地用 for 循环做到这一点。

window <- 3

for (i in 1:nrow(dat)){
  # Create index to search (if i is less than window, start at 1)
  index <- max(i - window, 1):i

  # Search when current row matches both team and venue
  dat$VenueCount[i] <- sum(dat$Team[i] == dat$Team[index] & dat$Venue[i] == dat$Venue[index])
}

   Round   Team   Venue VenueCount
1      1 Team 1 Venue B          1
2      2 Team 1 Venue B          2
3      3 Team 1 Venue A          1
4      4 Team 1 Venue A          2
5      5 Team 1 Venue B          2
6      1 Team 2 Venue B          1
7      2 Team 2 Venue B          2
8      3 Team 2 Venue A          1
9      4 Team 2 Venue A          2
10     5 Team 2 Venue A          3
11     1 Team 3 Venue B          1
12     2 Team 3 Venue A          1
13     3 Team 3 Venue B          2
14     4 Team 3 Venue B          3
15     5 Team 3 Venue B          3

但是,我想避免 for 循环(主要是因为我的实际数据集相对较大,约为 30k 行)。我认为使用zoodplyrpurrrapply 之一应该是可行的,但无法解决。

谢谢

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    在这里尝试data.table 解决方案。如果您只是在寻找dplyr 解决方案,将其删除

    您可以使用大小为 4 的窗口滚动,然后计算匹配最新行的出现次数。

    library(data.table)
    library(zoo)
    setDT(test)
    winsize <- 4
    test[, .(Round, 
            Venue, 
            VenueCount=rollapplyr(c(rep("", winsize-1), Venue), winsize, 
                function(x) sum(x==last(x)))), 
        by=.(Team)]
    

    结果:

    #       Team Round   Venue VenueCount
    #  1: Team 1     1 Venue B          1
    #  2: Team 1     2 Venue B          2
    #  3: Team 1     3 Venue A          1
    #  4: Team 1     4 Venue A          2
    #  5: Team 1     5 Venue B          2
    #  6: Team 2     1 Venue B          1
    #  7: Team 2     2 Venue B          2
    #  8: Team 2     3 Venue A          1
    #  9: Team 2     4 Venue A          2
    # 10: Team 2     5 Venue A          3
    # 11: Team 3     1 Venue B          1
    # 12: Team 3     2 Venue A          1
    # 13: Team 3     3 Venue B          2
    # 14: Team 3     4 Venue B          3
    # 15: Team 3     5 Venue B          3
    

    【讨论】:

    • 或者没有动物园test[, n := .SD[.(Team = Team, Venue = Venue, r_dn = Round - 3L, r_up = Round), on=.(Team, Venue, Round &gt;= r_dn, Round &lt;= r_up), .N, by=.EACHI]$N]
    • 谢谢!我对data.table 不是很熟悉,但很高兴将其留给其他人
    【解决方案2】:

    我实际上使用tibbletime 包中的rollifydplyr::mutate 得出了一个答案。将在此处发布,但仍对其他回复开放!

    library(dplyr)
    library(tibbletime)
    
    # Create data
    set.seed(123)
    test <- data.frame(
      Round = rep(1:5, times = 3),
      Team = rep(c("Team 1", "Team 2", "Team 3"), each = 5),
      Venue = sample(sample(c("Venue A", "Venue B"), 15, replace = T))
    )
    

    使用rollify 创建自定义函数。

    last_n_games = 3
    count_games <- rollify(function(x) sum(last(x) == x), window = last_n_games)
    

    现在使用 mutate 来运行函数。这将返回前 2 行的 NA(即last_n_games - 1)。然后我可以使用group_byrow_number 来计算这些第一次出现的次数

    test <- test %>%
      group_by(Team) %>%
      mutate(VenueCount = count_games(Venue)) %>%
      group_by(Team, Venue) %>%
      mutate(VenueCount = ifelse(is.na(VenueCount), row_number(Team), VenueCount))
    

    这将返回以下内容

    # A tibble: 15 x 4
    # Groups:   Team, Venue [6]
       Round Team   Venue   VenueCount
       <int> <fct>  <fct>        <int>
     1     1 Team 1 Venue B          1
     2     2 Team 1 Venue B          2
     3     3 Team 1 Venue A          1
     4     4 Team 1 Venue A          2
     5     5 Team 1 Venue B          1
     6     1 Team 2 Venue B          1
     7     2 Team 2 Venue B          2
     8     3 Team 2 Venue A          1
     9     4 Team 2 Venue A          2
    10     5 Team 2 Venue A          3
    11     1 Team 3 Venue B          1
    12     2 Team 3 Venue A          1
    13     3 Team 3 Venue B          2
    14     4 Team 3 Venue B          2
    15     5 Team 3 Venue B          3
    

    【讨论】:

      【解决方案3】:

      所以我喜欢使用 data.table,它速度快、用途广泛。

      这个想法是加入 2 次,有 2 个滞后 (round+1)(round+2),所以这就是我所做的。

      > test1<-test
      > test2<-test
      > test<-as.data.table(test)
      > test1<-as.data.table(test1)
      > test2<-as.data.table(test2)
      

      获取副本后将这些data.frames放入data.table中

      > test1[,Round:=Round+1,]
      > test2[,Round:=Round+2,]
      

      有滞后的回合,然后像这样将它们连接在一起:

      > test2[test1,on=c('Round','Team')][test,on=c('Round','Team')]
          Round   Team   Venue i.Venue i.Venue.1
       1:     1 Team 1      NA      NA   Venue B
       2:     2 Team 1      NA Venue B   Venue B
       3:     3 Team 1 Venue B Venue B   Venue A
       4:     4 Team 1 Venue B Venue A   Venue A
       5:     5 Team 1 Venue A Venue A   Venue B
       6:     1 Team 2      NA      NA   Venue B
       7:     2 Team 2      NA Venue B   Venue B
       8:     3 Team 2 Venue B Venue B   Venue A
       9:     4 Team 2 Venue B Venue A   Venue A
      10:     5 Team 2 Venue A Venue A   Venue A
      11:     1 Team 3      NA      NA   Venue B
      12:     2 Team 3      NA Venue B   Venue A
      13:     3 Team 3 Venue B Venue A   Venue B
      14:     4 Team 3 Venue A Venue B   Venue B
      15:     5 Team 3 Venue B Venue B   Venue B
      

      由于这个结果很多NA,这里我们使用来自R-Cookbook.com ben mentioned in his answer的函数

        compareNA <- function(v1,v2) {
          # This function returns TRUE wherever elements are the same, including NA's,
          # and false everywhere else.
          same <- (v1 == v2)  |  (is.na(v1) & is.na(v2))
          same[is.na(same)] <- FALSE
          return(same)
         }
      

      我们可以得到最终结果:

       > end <-
            test2[test1, on = c('Round', 'Team')][test, on = c('Round', 
            'Team')][, VenueCount :=
            (1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue)), ]
      

      说明: test2 右加入test1,在RoundTeam,右加入testRoundTeam,所以你得到:

      i.Venue.1Team 的当前地点, i.VenueTeam 的最后一个地点, VenueTeam的最后2个场地,

      有逻辑的

      (1 + compareNA(i.Venue.1, i.Venue) + compareNA(i.Venue.1, Venue))

      您可以计算球队在过去 3 轮比赛中在这个场地打了多少次。

      > end
          Round   Team   Venue i.Venue i.Venue.1 VenueCount
       1:     1 Team 1      NA      NA   Venue B          1
       2:     2 Team 1      NA Venue B   Venue B          2
       3:     3 Team 1 Venue B Venue B   Venue A          1
       4:     4 Team 1 Venue B Venue A   Venue A          2
       5:     5 Team 1 Venue A Venue A   Venue B          1
       6:     1 Team 2      NA      NA   Venue B          1
       7:     2 Team 2      NA Venue B   Venue B          2
       8:     3 Team 2 Venue B Venue B   Venue A          1
       9:     4 Team 2 Venue B Venue A   Venue A          2
      10:     5 Team 2 Venue A Venue A   Venue A          3
      11:     1 Team 3      NA      NA   Venue B          1
      12:     2 Team 3      NA Venue B   Venue A          1
      13:     3 Team 3 Venue B Venue A   Venue B          2
      14:     4 Team 3 Venue A Venue B   Venue B          2
      15:     5 Team 3 Venue B Venue B   Venue B          3
      

      希望对你有帮助

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2014-09-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-07-26
        • 1970-01-01
        相关资源
        最近更新 更多