将垂直数据转换为水平数据+提取第一个和最后一个日期答案

【问题标题】：Converting vertical data to horizontal data + extracting first and last date将垂直数据转换为水平数据+提取第一个和最后一个日期
【发布时间】：2022-01-01 00:56:41
【问题描述】：

我是使用 R 的新手。我有一个数据库，其中 2 列如下表所示：

pt_id	Date
1222	20-01-2021
1222	18-11-2018
1222	17-02-2015
1222	21-04-2015
2555	18-01-2002
2555	03-04-2009
2555	25-12-2010

我想创建一个合并 pt_id 的新数据框，并创建 2 列，其中仅保存第一个日期和最后一个日期。我希望它看起来像下面的表格

pt_id	Date_first	Date_last
1222	17-02-2015	20-01-2021
2555	18-01-2002	25-12-2010

上面的表格只是一个小例子，我正在使用的数据库要大得多。这些是我目前正在使用的软件包：

library(tidyverse)
library(haven)
library(tidyr)
library(dplyr)
library(date)
library(reshape2)
library(foreign)
library(data.table)
library(stringr)
library(plyr)
library(irr)
library(vcd)
library(vctrs)

我希望这是可能的，在此先感谢。

【问题讨论】：

请注意，当您运行 library(tidyverse) 时，您列出的某些包会被加载，因此存在一些冗余。

标签： r merge

【解决方案1】：

你可以这样做：

mydf |>
  mutate(Date = lubridate::dmy(Date)) |> # Only use if the variable is currently set to character
  group_by(pt_id) |>
  filter(Date == min(Date) | Date == max(Date)) |>
  mutate(date_vars = if_else(Date == min(Date), "Date_first", "Date_last")) |> 
  ungroup() |> 
  pivot_wider(pt_id, values_from = Date, names_from = date_vars)

# A tibble: 2 x 3
  pt_id Date_last  Date_first
  <dbl> <date>     <date>    
1  1222 2021-01-20 2015-02-17
2  2555 2010-12-25 2002-01-18

【讨论】：

您好菲尔，谢谢您的回答！很抱歉这么晚的回复，在家工作的麻烦！我试过你的脚本。不幸的是它还没有工作： pat_id Date_first Date_last 1 122894 2 336419 3 467840 警告消息：值不是唯一标识的；输出将包含列表列。 * 使用values_fn = list 抑制此警告。 * 使用values_fn = length 确定重复出现的位置 * 使用values_fn = {summary_fun} 汇总重复
这意味着您没有提到的其他变量具有相同的特征，因此透视函数无法判断哪个值应该用于哪种情况。最简单的做法是根据个别情况创建另一个唯一的，然后在 pivot_wider() 函数中引用它而不是 pt_id。
在使用您的脚本经过反复试验后，它最终成功了！非常感谢菲尔！

【解决方案2】：

我也是 R 新手，我想我会尝试一下，可能会启发某人纠正我 - 这仅使用基本 R 并且有一个非常复杂的 for 循环：

df1$Date <- as.Date(df1$Date)
#Create a new df with a single entry for each patient
new_df <- unique(df1["pt_id"])

#make empty columns for the dates
new_df['date_first'] <- NA
new_df['date_last'] <- NA

#for each patient...
for (i in (1:nrow(new_df))){
  #make an empty list to store the list of dates...
  pt <- c()
  #and for each row in the original dataframe...
  for (j in (1:nrow(df1))){
    #make another empty list to store the single date for each record
    x = c()
    # and if the patient ID is in the row being read...
    if(df1[j,1]==new_df[i,1]){
      #append that date to list x and move on through the original df
      x<-append(x,df1[j,2])
    }
  #then append this list to the list pt...
  pt <- append(pt,x)
  }
#add the min and max values from the list pt to the new df for each entry in the new dataframe and move on to the next patient in the new dataframe
new_df[i,2] <- min(pt)
new_df[i,3] <- max(pt)
}

#make the columns dates again (as they were converted to numeric in the 
max/min)
new_df$date_first <- as.Date(new_df$date_first)
new_df$date_last <- as.Date(new_df$date_last)

print(new_df)

【讨论】：

您好 HMCC，感谢您的回答。我试过你的脚本。但是该脚本需要很长时间才能运行，并且最终大部分时间都会停止。我认为这是因为数据集可能太大（> 200.000 pt_id 和> 700 万行）