检查连续几年在 R 中的奉献答案

【问题标题】：Check for consecutive years of giving in R检查连续几年在 R 中的奉献
【发布时间】：2015-01-05 19:51:08
【问题描述】：

我在一家非营利组织工作，有人想要一份在 5 年内捐款 100 美元或更多的人的名单。跨度可以是任何地方，只要他们连续 5 年给出。我的电脑上有 Python 和 R。 R 似乎对此会更好，但我对它不是很熟悉。

我已经导入了一份 csv 文件，其中包含每份给组织的礼物以及谁给的礼物。

这是 csv 文件中的示例行。

Gf_Gift_ID：1620192
Gf_Date：2005 年 1 月 31 日
Gf_Amount: 25.00
Gf_CnBio_ID：512994

我无法在此处正确格式化。第一部分是标题。

我需要能够查看用户 512994 在例如 2014、2013、2012、2011 和 2010 年（连续五年）是否提供了 100 或更多。

到目前为止，我在 R 脚本中有这个：

gifts <- read.csv("---------")
donors <- gifts["Gf_CnBio_ID"]
donors <- unique(donors)

我一直试图弄清楚如何制作一个较小的数据框，它是礼物的子集，通过捐助者一次一个地查看礼物的子集，然后我会检查该人连续多少年已经给。我尝试过的不同方式都不断出错。

提前致谢。我的大部分背景都是 Java，所以这种语言不是我习惯的。

补充：

> library(dplyr)
> library(lubridate)
> 
> set.seed(999)
> 
> gifts <- read.csv("---.CSV", header = TRUE, sep = ",", )
> donors <- gifts["Donor_ID"]
> donors <- unique(donors)
> 
> gifts %>%
+   mutate(gift_year = year(gifts["Gift_Date"])) %>% # extract year
+   group_by(gifts["Donor_ID"], gift_year) %>% 
+   summarise(year_gift = sum(gifts["Gift_Amount"])) %>% # total gift per donor/year
+   filter(year_gift >= 100) %>% 
+   group_by(bio_id) %>% 
+   mutate(diff = gift_year - lag(gift_year), rle = rep( rle(diff)$lengths, rle(diff)$lengths)) %>% 
+   filter(rle >= 5) %>% 
+   distinct(bio_id)
Error in as.POSIXlt.default(x, tz = tz(x)) : 
  do not know how to convert 'x' to class “POSIXlt”

在尝试运行提供的解决方案时，我不断收到该错误输出。我做了一个 Python 程序将日期重新格式化为 yyyy-mm-dd 00:00:00 格式，但我仍然收到错误，所以它不是来自日期格式。我不知道是什么原因造成的。这是前 50 行。

> dput(shortExport)
structure(list(Gift_ID = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 1620192L, 1630540L, 1661287L, 1670815L, 1702338L, 
1710859L, 1747572L, 1781100L, 1811188L, 1829753L, 1854499L, 1860830L, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 1361280L, 1246941L, 1355077L, 1243748L, 1243748L, 1518414L
), Gift_Date = structure(c(2L, 23L, 32L, 10L, 17L, 19L, 33L, 
44L, 45L, 11L, 27L, 30L, 47L, 3L, 26L, 9L, 18L, 31L, 37L, 22L, 
28L, 16L, 25L, 34L, 39L, 21L, 42L, 1L, 29L, 35L, 43L, 6L, 13L, 
4L, 5L, 38L, 41L, 46L, 15L, 24L, 40L, 2L, 12L, 20L, 14L, 7L, 
8L, 8L, 36L), .Label = c("1/29/2010 0:00", "1/30/2014 0:00", 
"1/31/2005 0:00", "1/31/2012 0:00", "1/31/2013 0:00", "10/11/2011 0:00", 
"10/18/2000 0:00", "10/27/1998 0:00", "10/31/2005 0:00", "10/31/2011 0:00", 
"10/31/2012 0:00", "11/1/2011 0:00", "11/11/2011 0:00", "11/18/1998 0:00", 
"11/27/2013 0:00", "11/30/2007 0:00", "11/30/2011 0:00", "12/30/2005 0:00", 
"12/30/2011 0:00", "12/6/2000 0:00", "2/27/2009 0:00", "2/28/2007 0:00", 
"2/28/2011 0:00", "2/28/2014 0:00", "2/29/2008 0:00", "3/31/2005 0:00", 
"3/31/2013 0:00", "4/30/2007 0:00", "4/30/2010 0:00", "4/30/2013 0:00", 
"5/31/2006 0:00", "5/31/2011 0:00", "6/29/2012 0:00", "6/30/2008 0:00", 
"6/30/2011 0:00", "7/18/2003 0:00", "7/31/2006 0:00", "7/31/2013 0:00", 
"8/29/2008 0:00", "8/29/2014 0:00", "8/30/2013 0:00", "8/31/2009 0:00", 
"8/31/2011 0:00", "8/31/2012 0:00", "9/28/2012 0:00", "9/30/2013 0:00", 
"9/30/2014 0:00"), class = "factor"), Gift_Amount = c(25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 
25L, 25L, 25L, 25L, 25L, 25L, 25L, 50L, 50L, 50L, 50L, 50L, 50L, 
50L, 10L, 10L, 100L, 100L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 100L, 250L, 50L, 30L, 25L, 50L, 50L, 50L), Donor_ID = c(677556L, 
521512L, 521512L, 521512L, 521512L, 521512L, 521512L, 521512L, 
521512L, 521512L, 521512L, 521512L, 521512L, 512994L, 512994L, 
512994L, 512994L, 512994L, 512994L, 512994L, 512994L, 512994L, 
512994L, 512994L, 512994L, 512994L, 512994L, 512994L, 512994L, 
512994L, 512994L, 512994L, 512994L, 512994L, 512994L, 512994L, 
512994L, 512994L, 512994L, 512994L, 512994L, 679277L, 406147L, 
331525L, 332110L, 332110L, 263700L, 263701L, 100196L)), .Names = c("Gift_ID", 
"Gift_Date", "Gift_Amount", "Donor_ID"), class = "data.frame", row.names = c(NA, 
-49L))

【问题讨论】：

标签： r csv subset

【解决方案1】：

实现您的目标涉及链接许多操作（例如，按捐赠者/年份汇总、按 > 100 美元的礼物过滤等）。 dplyr 包提供了很好的功能：

library(dplyr)
library(lubridate)
library(tidyr)

# gifts defined in question

gifts %>%
  mutate(
    gift_date = as.Date(str_sub(Gift_Date, end = -6), format = "%m/%d/%Y"),
    gift_year = year(gift_date) 
  ) %>%   
  group_by(Donor_ID, gift_year) %>% 
  summarise(year_total = sum(Gift_Amount)) %>%
  filter(year_total >= 100) %>% 
  group_by(Donor_ID) %>% 
  mutate(
    jump = !(gift_year == lag(gift_year) + 1 | row_number() == 1),
    donor_seq = cumsum(jump) + 1,
    rle = rep(rle(donor_seq)$lengths, rle(donor_seq)$lengths)
  ) %>%
  filter(rle >= 5) %>% 
  distinct(Donor_ID)

【讨论】：

也许使用rle 来表示运行长度？例如。 filter(year_gift >= 100) %>% group_by(bio_id) %>% mutate(diff = gift_year - lag(gift_year), rle = rep( rle(diff)$lengths, rle(diff)$lengths)) %>% filter(rle >= 5) %>% distinct(bio_id)
当我尝试运行其中任何一个时，我收到此错误：as.POSIXlt.default(x, tz = tz(x)) 中的错误：不知道如何将 'x' 转换为类“POSIXlt”与源是 CSV 有什么关系？我不知道这是什么意思。
gift_year = year(gifts["Gf_Date"]) 导致该错误...我认为这是因为礼物日期存储为 mm/dd/yyyy 而不是更好的 yyyy/mm/dd。有没有办法重新格式化这些？
@nacnudus - rle() 的好建议。我将修改我的答案以使用它。
@ABvsPred - 请通过提供 dput(gifts) 使您的问题可重现

【解决方案2】：

如果没有实际的示例数据集可供使用，我无法告诉您如何提取日期，但我们假设您有一列带有捐赠者 ID，另一列带有礼物日期。然后，遍历donorID 值（或使用一种或另一种工具拆分您的数据集）并使用我自己的一个小函数seqle，在cgwtools 包中提供github.com/cellocgwgithub.com/cellocgw。假设您确定捐赠者不会在同一年捐赠两次，那么您所要做的就是找到一个比4 更长的序列。

示例如下。为简单起见，我使用了 1 到 14 岁左右和 3 个捐赠者。

 donmat
      donor   donyear
 [1,] "bob"   "1"    
 [2,] "carol" "1"    
 [3,] "alice" "1"    
 [4,] "bob"   "2"    
 [5,] "carol" "2"    
 [6,] "alice" "3"    
 [7,] "bob"   "3"    
 [8,] "carol" "3"    
 [9,] "alice" "4"    
[10,] "bob"   "5"    
[11,] "carol" "4"    
[12,] "alice" "5"    
[13,] "bob"   "6"    
[14,] "carol" "5"    
[15,] "alice" "7"    
[16,] "bob"   "8"    
[17,] "carol" "7"    
[18,] "alice" "8"    
[19,] "bob"   "9"    
[20,] "carol" "8"    
[21,] "alice" "9"    
[22,] "bob"   "12"   
[23,] "carol" "9"    
[24,] "alice" "11"   
[25,] "bob"   "13"   
[26,] "carol" "9"    
[27,] "alice" "12"   
[28,] "bob"   "14"   
[29,] "carol" "10"   
[30,] "alice" "13"   
Rgames> donlen <- list()
Rgames> for(j in unique(donmat[,'donor'])) donlen[[j]] <- seqle(donmat[donmat[,'donor']==j,2])
Rgames> donlen
$bob
Run Length Encoding
  lengths: int [1:4] 3 2 2 3
  values : num [1:4] 1 5 8 12

$carol
Run Length Encoding
  lengths: int [1:3] 5 3 2
  values : num [1:3] 1 7 9

$alice
Run Length Encoding
  lengths: int [1:4] 1 3 3 3
  values : num [1:4] 1 3 7 11

因此，通过查看长度，我们看到“carol”有一个五年的序列。您可能希望使用 lubridate 从日期字符串中提取年份值。

【讨论】：