从单独的列 R 中获取最新的月份和年份答案

【问题标题】：get latest month and year from separate columns R从单独的列 R 中获取最新的月份和年份
【发布时间】：2021-01-04 22:10:53
【问题描述】：

我有一个数据框（在 R 中），其中包含月份和年份的单独列。对于数据框中的每个组，我想获得该系列的最后 12 个月。如果数据缺少最近几个月的数据，我想用上一年的一个月替换。例如，假设我想获取 2020 年（1 月 - 12 月）的数据，但一组最近的数据是 9 月，那么我想从 2019 年 10 月到 12 月拉取数据。我只是不知道如何这样做。

这是一个例子。

a = expand.grid(1:2,2019,1:12)
b = expand.grid(1:2,2020,1:9)
dat = rbind(a,b)
names(dat) = c("group","year","month")
dat = dat[order(dat$group,dat$year,dat$month),]

所以数据看起来像这样：

   group year month
1      1 2019     1
3      1 2019     2
5      1 2019     3
7      1 2019     4
9      1 2019     5
11     1 2019     6

【问题讨论】：

您的数据是否超过两年？如果是这样，当 2019 年和 2020 年都缺少数据时，您如何处理这种情况？你是从2018年先填2019年，然后用更新的2019年数据填2020年吗？还是您只回顾一年的原始数据（在这种情况下，2020 年将显示 2019 年和 2020 年缺失月份的 NA）？
我只需要回顾一年，因为每个月的数据都是完整的。我只是有数据没有进来的情况，所以我需要针对不同的组进行调整。

标签： r

【解决方案1】：

我不确定，总是按组获取数据的最后 12 个条目是否足够？在这种情况下，下面的方法会起作用。

这种方法假设：

每个月是一行/观察（一个月没有重复输入）。
如果一个月是NA，它也会被“绘制”。
您总是对获取最近的 12 个月感兴趣（因此它不是从指定输入日期开始的 12 个月）。

# your data
a = expand.grid(1:2,2019,1:12)
b = expand.grid(1:2,2020,1:9)
dat = rbind(a,b)
names(dat) = c("group","year","month")
dat = dat[order(dat$group,dat$year,dat$month),]

# using dplyr
library(dplyr)

dat %>% 
  group_by(group) %>% 
  slice_tail(n = 11) 

#> # A tibble: 22 x 3
#> # Groups:   group [2]
#>    group  year month
#>    <int> <dbl> <int>
#>  1     1  2019    11
#>  2     1  2019    12
#>  3     1  2020     1
#>  4     1  2020     2
#>  5     1  2020     3
#>  6     1  2020     4
#>  7     1  2020     5
#>  8     1  2020     6
#>  9     1  2020     7
#> 10     1  2020     8
#> # … with 12 more rows

# using base R

do.call("rbind",
  lapply(split(dat, dat$group), function(x) {
  x[(nrow(x)-11):nrow(x), ]
  }))

#>      group year month
#> 1.19     1 2019    10
#> 1.21     1 2019    11
#> 1.23     1 2019    12
#> 1.25     1 2020     1
#> 1.27     1 2020     2
#> 1.29     1 2020     3
#> 1.31     1 2020     4
#> 1.33     1 2020     5
#> 1.35     1 2020     6
#> 1.37     1 2020     7
#> 1.39     1 2020     8
#> 1.41     1 2020     9
#> 2.20     2 2019    10
#> 2.22     2 2019    11
#> 2.24     2 2019    12
#> 2.26     2 2020     1
#> 2.28     2 2020     2
#> 2.30     2 2020     3
#> 2.32     2 2020     4
#> 2.34     2 2020     5
#> 2.36     2 2020     6
#> 2.38     2 2020     7
#> 2.40     2 2020     8
#> 2.42     2 2020     9

# using data.table (from @thelatemail's comment)
library(data.table)

setDT(dat)
setorder(dat, group, year, month)
dat[, .SD[(.N-11):.N], by = group]

#>     group year month
#>  1:     1 2019    10
#>  2:     1 2019    11
#>  3:     1 2019    12
#>  4:     1 2020     1
#>  5:     1 2020     2
#>  6:     1 2020     3
#>  7:     1 2020     4
#>  8:     1 2020     5
#>  9:     1 2020     6
#> 10:     1 2020     7
#> 11:     1 2020     8
#> 12:     1 2020     9
#> 13:     2 2019    10
#> 14:     2 2019    11
#> 15:     2 2019    12
#> 16:     2 2020     1
#> 17:     2 2020     2
#> 18:     2 2020     3
#> 19:     2 2020     4
#> 20:     2 2020     5
#> 21:     2 2020     6
#> 22:     2 2020     7
#> 23:     2 2020     8
#> 24:     2 2020     9
#>     group year month

^{由reprex package (v0.3.0) 于 2021-01-04 创建}

【讨论】：

我认为逻辑会起作用，但您需要每个组都这样做。另外，nrow(dat)-11 我认为是正确的，否则你回去太多了。
@thelatemail：感谢您指出这一点，我完全没有看到这一点；） - 相应地更新了答案。
data.table 有趣的翻译 - setDT(dat); setorder(dat, group, year, month); dat[, .SD[(.N-11):.N], by=group]
@thelatemail：谢谢！我参考您的评论将其添加到答案中。