【问题标题】:Get initial month from list of dates从日期列表中获取初始月份
【发布时间】:2019-05-23 23:08:55
【问题描述】:

我有一个包含两个变量的数据集:DATE服务年限(仅用于制作一个可重复的小示例)。 我需要获取此人开始工作的月份(本示例为 1989-06 年),考虑到如果解决方案适用于许多人,则开始工作的月份可能因人而异。 像这样的:

library(data.table)
dt <- structure(list(DATE = c("2009-01", "2009-02", "2009-03", "2009-04", 
                          "2009-05", "2009-06", "2009-07", "2009-08", "2009-09", "2009-10", 
                          "2009-11", "2009-12", "2010-01", "2010-02", "2010-03", "2010-04", 
                          "2010-05", "2010-06", "2010-07", "2010-08", "2010-09", "2010-10", 
                          "2010-11", "2010-12", "2011-01", "2011-02", "2011-03", "2011-04", 
                          "2011-05", "2011-06", "2011-07", "2011-08", "2011-09", "2011-10", 
                          "2011-11", "2011-12"), Years_service = c(19, 19, 19, 19, 19, 
                                                                   20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 
                                                                   21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22), 
                 INITIAL_MONTH = c("1989-06", "1989-06", "1989-06", "1989-06", 
                                   "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                   "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                   "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                   "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                   "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                   "1989-06", "1989-06")), .Names = c("DATE", "Years_service", 
                                                                      "INITIAL_MONTH"), class = c("data.table", "data.frame"), row.names = c(NA,-36L))

head(dt)
      DATE Years_service INITIAL_MONTH
1: 2009-01            19       1989-06
2: 2009-02            19       1989-06
3: 2009-03            19       1989-06
4: 2009-04            19       1989-06
5: 2009-05            19       1989-06
6: 2009-06            20       1989-06

如何在 R 中获取它?

【问题讨论】:

  • INITIAL_MONTH 是您的预期输出吗?您如何根据 DateYears_service 计算它?您如何获得所有输出 1989-06
  • 是的,该列是我的预期输出。我通过减去日期 - 服务年数来计算该列。
  • 对于第一行,假设 2009-01 是年和月,如果你减去 19 年,你不应该得到 1990-01 吗?
  • 是的,但正如您在 Years_service 列中看到的那样,当值为 jun 时它会发生变化,我需要 INITIAL MONTH 是唯一的,这就是我重复此值的原因。这是个人开始工作的唯一日期

标签: r data.table


【解决方案1】:

我们可以在Years_service 列中找到第一个变化,然后用该索引处对应的DATE 值减去它。

library(dplyr)
library(lubridate)

dt %>%
  mutate(inds = which.max(diff(Years_service) != 0) + 1, 
        init_month = format(as.Date(paste0(DATE[inds], "-01")) - 
                      years(Years_service[inds]), "%Y-%m")) %>%
  select(-inds)

#      DATE Years_service INITIAL_MONTH init_month
#1  2009-01            19       1989-06    1989-06
#2  2009-02            19       1989-06    1989-06
#3  2009-03            19       1989-06    1989-06
#4  2009-04            19       1989-06    1989-06
#....

您可能想为多人执行此操作,您可以在其中添加 group_by 子句

dt %>%
  group_by(person) %>%
  mutate(inds = which.max(diff(Years_service) != 0) + 1, 
         init_month = format(as.Date(paste0(DATE[inds], "-01")) - 
                       years(Years_service[inds]), "%Y-%m")) %>%
  select(-inds)

编辑

对于更新的案例,我们可能需要先arrangedates

dt1 <- dt[order(-DATE)]

dt1 %>%
  mutate(dates = as.Date(paste0(DATE, "-01"))) %>%
  arrange(dates) %>%
  mutate(inds = which.max(diff(Years_service) != 0) + 1, 
     init_month = format(dates[inds] - years(Years_service[inds]), "%Y-%m")) %>%
  select(-inds)

【讨论】:

  • 能否给出不依赖于数据顺序的解决方案?
  • @Israel 你所说的数据顺序是什么意思?
  • 是的,数据顺序。
  • 因为如果我改变数据的顺序,你的解决方案也会改变
  • 怎么样?你能告诉我一个失败的例子吗?没有可重现的例子就很难提供解决方案。
【解决方案2】:

基础 R 解决方案

使用seq 倒数月数

  1. 使用sprintf 创建一个新的Date 向量与天(%d)(以取悦as.Date 函数)
dt$Date <- sprintf("%s-01",dt$DATE)
  1. 创建格式为-X months 的字符串向量以在seq 中倒数
dt$Back_step <- sprintf("-%s months",dt$Years_service)
  1. 使用 for 循环循环显示 X 个月前打印日期的行
for(i in 1:nrow(dt)){
  dt$INITIAL_MONTH[i] <- as.character(seq(as.Date(dt$Date[i],format="%Y-%m-%d"), 
                                                  length = 2, by = dt$Back_step[i])[2])
}

注意[2] 表明我们正在取序列中的第二个值

【讨论】:

    【解决方案3】:

    同时添加一个 data.table 解决方案。

    # Find the initial month
    dt1 <- dt[order(DATE)]
    dt1[, diff:=Years_service - shift(Years_service)]
    dt2 <- dt1[diff==1, head(.SD, 1)]
    # calculate the year
    dt2[, init_month:=paste0(as.numeric(substr(DATE, 1, 4))-Years_service, '-', substr(DATE, 6, 7))]
    # write back to the original data.table
    init_mon <- dt2$init_month[1]
    dt <- dt[, init_month:=init_mon]
    

    如果数据中有多人:

    library(data.table)
    dt <- structure(list(PERSON = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
                                    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
                                    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
                         DATE = c("2009-01", "2009-02", "2009-03", "2009-04", 
                              "2009-05", "2009-06", "2009-07", "2009-08", "2009-09", "2009-10", 
                              "2009-11", "2009-12", "2010-01", "2010-02", "2010-03", "2010-04", 
                              "2010-05", "2010-06", "2010-07", "2010-08", "2010-09", "2010-10", 
                              "2010-11", "2010-12", "2011-01", "2011-02", "2011-03", "2011-04", 
                              "2011-05", "2011-06", "2011-07", "2011-08", "2011-09", "2011-10", 
                              "2011-11", "2011-12", "2009-01", "2009-02", "2009-03", "2009-04", 
                              "2009-05", "2009-06", "2009-07", "2009-08", "2009-09", "2009-10", 
                              "2009-11", "2009-12", "2010-01", "2010-02", "2010-03", "2010-04", 
                              "2010-05", "2010-06", "2010-07", "2010-08", "2010-09", "2010-10", 
                              "2010-11", "2010-12", "2011-01", "2011-02", "2011-03", "2011-04", 
                              "2011-05", "2011-06", "2011-07", "2011-08", "2011-09", "2011-10", 
                              "2011-11", "2011-12"), Years_service = c(19, 19, 19, 19, 19, 
                                                                       20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 
                                                                       21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 19, 19, 19, 19, 19, 
                                                                       20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 
                                                                       21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22), 
                     INITIAL_MONTH = c("1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", "1989-06", 
                                       "1989-06", "1989-06")), .Names = c("PERSON", "DATE", "Years_service", 
                                                                          "INITIAL_MONTH"), class = c("data.table", "data.frame"), row.names = c(NA,-36L))
    
    
    head(dt)
    
    # PERSON    DATE    Years_service   INITIAL_MONTH
    # 1         2009-01 19              1989-06
    # 1         2009-02 19              1989-06
    # 1         2009-03 19              1989-06
    # 1         2009-04 19              1989-06
    # 1         2009-05 19              1989-06
    # 1         2009-06 20              1989-06
    

    在计算中添加分组依据

    dt1 <- dt[order(PERSON, DATE)]
    dt1[, diff:=Years_service - shift(Years_service), by="PERSON"]
    dt2 <- dt1[diff==1, head(.SD, 1), by="PERSON"]
    dt2[, init_month:=paste0(as.numeric(substr(DATE, 1, 4))-Years_service, '-', substr(DATE, 6, 7))]
    dt <- merge(dt, dt2[, list(PERSON, init_month)], on=c("PERSON"), all.x=TRUE)
    

    【讨论】:

    • 你的解决方案很好,但我想要一个不依赖于数据顺序的解决方案
    猜你喜欢
    • 2018-12-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-07-01
    相关资源
    最近更新 更多