【问题标题】:Get most frequent string from a data frame column从数据框列中获取最频繁的字符串
【发布时间】:2014-10-31 13:33:22
【问题描述】:

我需要使用多行数据框作为输入,返回字符串出现频率最高的 n 个。所有值都在名为“MissingDates”的同一列中

这里是示例数据,总共大约有 5000 行:

Symbol Count  MissingDates     
AD  27  1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11-29, 1996-12-09, 1996-12-20, 1996-12-23, 1996-12-26, 1996-12-27, 1997-01-02, 1997-05-02, 1997-09-10, 1998-01-02, 1998-04-16, 1998-12-08, 1999-12-27, 1999-12-31, 2001-09-12, 2003-08-06, 2003-10-13
BP  14  1995-08-09, 1995-08-15, 1995-12-26, 1996-01-02, 1996-09-06, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-09-12, 2002-12-24, 2003-08-06, 2003-10-13
C   3   1999-12-31, 2001-12-24, 2002-12-24
CC  285 1994-05-18, 1994-05-19, 1994-05-20, 1994-05-23, 1994-05-24, 1994-05-25, 1994-05-26, 1994-05-27, 1994-05-31, 1994-06-01, 1994-06-02, 1994-06-03, 1994-06-06, 1994-06-07, 1994-06-08, 1994-06-09, 1994-06-10, 1994-06-13, 1994-06-14, 1994-06-15, 1994-06-16, 1994-06-17, 1994-06-20, 1994-06-21, 1994-06-23, 1994-06-24, 1994-06-27, 1994-06-28, 1994-06-29, 1994-06-30, 1994-07-01, 1994-07-06, 1994-07-14, 1994-07-15, 1994-07-18, 1994-07-19, 1994-07-21, 1994-07-25, 1994-07-27, 1994-07-28, 1994-08-03, 1994-08-04, 1994-08-08, 1994-08-09, 1994-08-10, 1994-08-11, 1994-08-12, 1994-08-15, 1994-08-17, 1994-08-18, 1994-08-19, 1994-08-22, 1994-08-23, 1994-08-24, 1994-08-25, 1994-08-29, 1994-08-31, 1994-09-01, 1994-09-02, 1994-09-06, 1994-09-07, 1994-09-08, 1994-09-09, 1994-09-12, 1994-09-13, 1994-09-15, 1994-09-16, 1994-09-19, 1994-09-20, 1994-09-21, 1994-09-22, 1994-09-23, 1994-09-27, 1994-09-28, 1994-09-29, 1994-09-30, 1994-10-03, 1994-10-04, 1994-10-06, 1994-10-14, 1994-10-18, 1994-10-19, 1994-10-25, 1994-10-26, 1994-10-27, 1994-10-28, 1994-10-31, 1994-11-01, 1994-11-09, 1994-11-10, 1994-11-11, 1994-11-16, 1994-11-17, 1994-11-25, 1994-11-28, 1994-12-01, 1994-12-02, 1994-12-06, 1994-12-07, 1994-12-08, 1994-12-09, 1994-12-12, 1994-12-13, 1994-12-14, 1994-12-15, 1994-12-16, 1994-12-23, 1994-12-27, 1994-12-29, 1994-12-30, 1995-01-03, 1995-01-05, 1995-01-09, 1995-01-11, 1995-01-13, 1995-01-16, 1995-01-17, 1995-01-18, 1995-01-19, 1995-01-20, 1995-01-24, 1995-01-25, 1995-02-13, 1995-02-17, 1995-05-01, 1995-07-03, 1995-11-24, 1995-12-26, 1996-01-08, 1996-01-09, 1996-07-05, 1996-11-29, 1996-12-26, 1997-11-28, 1997-12-26, 1998-01-02, 1998-11-27, 1999-06-17, 1999-06-18, 1999-06-21, 1999-06-22, 1999-06-23, 1999-06-24, 1999-06-25, 1999-06-28, 1999-06-29, 1999-06-30, 1999-07-01, 1999-07-02, 1999-07-06, 1999-07-07, 1999-07-08, 1999-07-09, 1999-07-12, 1999-07-13, 1999-07-14, 1999-07-15, 1999-07-16, 1999-07-19, 1999-07-20, 1999-07-21, 1999-07-22, 1999-07-23, 1999-07-26, 1999-07-27, 1999-07-28, 1999-07-29, 1999-07-30, 1999-08-02, 1999-08-03, 1999-08-04, 1999-08-05, 1999-08-06, 1999-08-09, 1999-08-10, 1999-08-11, 1999-08-12, 1999-08-13, 1999-08-16, 1999-08-17, 1999-08-18, 1999-08-19, 1999-08-20, 1999-08-23, 1999-08-24, 1999-08-25, 1999-08-26, 1999-08-27, 1999-08-30, 1999-08-31, 1999-09-01, 1999-09-02, 1999-09-03, 1999-09-07, 1999-09-08, 1999-09-09, 1999-09-10, 1999-09-13, 1999-09-14, 1999-09-15, 1999-09-16, 1999-09-17, 1999-09-20, 1999-09-21, 1999-09-22, 1999-09-23, 1999-09-24, 1999-09-27, 1999-09-28, 1999-09-29, 1999-09-30, 1999-10-01, 1999-10-04, 1999-10-05, 1999-10-06, 1999-10-07, 1999-10-08, 1999-10-11, 1999-10-12, 1999-10-13, 1999-10-14, 1999-10-15, 1999-10-18, 1999-10-19, 1999-10-20, 1999-10-21, 1999-10-22, 1999-10-25, 1999-10-26, 1999-10-27, 1999-10-28, 1999-10-29, 1999-11-01, 1999-11-02, 1999-11-03, 1999-11-04, 1999-11-05, 1999-11-08, 1999-11-09, 1999-11-10, 1999-11-11, 1999-11-12, 1999-11-15, 1999-11-16, 1999-11-17, 1999-11-18, 1999-11-19, 1999-11-22, 1999-11-23, 1999-11-24, 1999-11-26, 1999-11-29, 1999-11-30, 1999-12-01, 1999-12-02, 1999-12-03, 1999-12-06, 1999-12-07, 1999-12-08, 1999-12-09, 1999-12-10, 1999-12-13, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-13, 2001-09-14, 2001-11-23, 2001-12-24, 2001-12-26, 2001-12-31, 2002-07-05, 2002-11-29, 2002-12-26, 2003-02-18, 2003-11-28, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24, 2011-01-03
CD  14  1995-08-09, 1995-12-26, 1996-01-02, 1996-06-11, 1996-06-20, 1996-09-09, 1996-09-11, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-01-02, 2001-09-12
CT  154 1995-11-24, 1996-01-08, 1996-07-05, 1996-11-29, 1996-12-24, 1997-11-28, 1997-12-26, 1998-11-27, 1999-11-26, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-11, 2001-09-12, 2001-09-13, 2001-09-14, 2001-11-12, 2001-11-23, 2001-12-24, 2001-12-31, 2002-05-21, 2002-05-22, 2002-05-23, 2002-05-24, 2002-05-28, 2002-05-29, 2002-05-30, 2002-05-31, 2002-06-03, 2002-06-04, 2002-06-05, 2002-06-06, 2002-06-07, 2002-06-10, 2002-06-11, 2002-06-12, 2002-06-13, 2002-06-14, 2002-06-17, 2002-06-18, 2002-06-19, 2002-06-20, 2002-06-21, 2002-06-24, 2002-06-25, 2002-06-26, 2002-06-27, 2002-06-28, 2002-07-01, 2002-07-02, 2002-07-03, 2002-07-05, 2002-07-08, 2002-07-09, 2002-07-10, 2002-07-11, 2002-07-12, 2002-07-15, 2002-07-16, 2002-07-17, 2002-07-18, 2002-07-19, 2002-07-22, 2002-07-23, 2002-07-24, 2002-07-25, 2002-07-26, 2002-07-29, 2002-07-30, 2002-07-31, 2002-08-01, 2002-08-02, 2002-08-05, 2002-08-06, 2002-08-07, 2002-08-08, 2002-08-09, 2002-08-12, 2002-08-13, 2002-08-14, 2002-08-15, 2002-08-16, 2002-08-19, 2002-08-20, 2002-08-21, 2002-08-22, 2002-08-23, 2002-08-26, 2002-08-27, 2002-08-28, 2002-08-29, 2002-08-30, 2002-09-03, 2002-09-04, 2002-09-05, 2002-09-06, 2002-09-09, 2002-09-10, 2002-09-11, 2002-09-12, 2002-09-13, 2002-09-16, 2002-09-17, 2002-09-18, 2002-09-19, 2002-09-20, 2002-09-23, 2002-09-24, 2002-09-25, 2002-09-26, 2002-09-27, 2002-09-30, 2002-10-01, 2002-10-02, 2002-10-03, 2002-10-04, 2002-10-07, 2002-10-08, 2002-10-09, 2002-10-10, 2002-10-11, 2002-10-14, 2002-10-15, 2002-10-16, 2002-10-17, 2002-10-18, 2002-10-21, 2002-10-22, 2002-10-23, 2002-10-24, 2002-10-25, 2002-10-28, 2002-10-29, 2002-10-30, 2002-10-31, 2002-11-01, 2002-11-04, 2002-11-05, 2002-11-06, 2002-11-07, 2002-11-29, 2002-12-24, 2003-02-18, 2003-11-28, 2003-12-26, 2004-01-02, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24

因此,该函数将传递一个参数,该参数将从 data.frame 返回上述日期的 n 次最频繁出现。

我查看了 which.max,但无法弄清楚如何将其应用于多行(整个数据框列),或者给我多个日期 (n) 作为输出。

如果只有一个输出值的代码会简单得多,那作为我工作的起点是可以接受的。任何指针表示赞赏。

这是一个 pastebin,因为字符串的长度导致我遇到了问题: http://pastebin.com/B1YPicC8

> str(间隙) 'data.frame':5560 obs。 3个变量: $ 符号:因子 w/ 5560 级别 "@AD#","@BP#",..: 1 2 3 4 5 6 7 8 9 10 ... $计数:int 27 14 3 285 14 154 540 11 3 11 ... $ MissingDates:因子 w/ 3568 个级别“1995-12-26、1996-01-02、1996-04-26、1996-04-30、1996-05-06、1996-08-26、1996-09-03 , 1996-09-04, 1996-10-11, 1996-11-13, 1996-11"| __截断__,..:1 2 3 4 5 6 7 8 9 10 ...

【问题讨论】:

  • 如果它们都是日期,那么您可能只需使用 table(data) 并从中逐行获取最大值
  • 你的数据结构不清楚。可以提供dput(head(df$MissingDates))吗?
  • 它们是日期字符串,但不归类为日期。而且我不需要最大值(最高/最近),我需要最频繁的。
  • @DavidArenburg 我尝试了该命令,但输出为 20,000 行(head 不起作用)。我认为这是因为在某些情况下,一行可能有一千个或更多日期,并且它会换行。该列的格式是字符串“日期”逗号空格
  • str(df$MissingDates) 带给你什么?

标签: r string dataframe count


【解决方案1】:

看来你需要这样的东西:

功能

freqfunc <- function(x, n){
  tail(sort(table(unlist(strsplit(as.character(x), ", ")))), n)
}

测试您的数据集

freqfunc(gaps$MissingDates, 5) # Five most frequent dates

## 1996-12-26 1997-12-26 1998-01-02 1999-12-31 2001-09-12 
##          4          4          4          4          4 

【讨论】:

  • 确实很棒。
【解决方案2】:

这是另一种可能的解决方案。

设置一些数据,

set.seed(5)
ss1 <- sample(seq(s <- Sys.Date(), s+10, "day"), 20, TRUE)
ss2 <- sample(seq(s <- Sys.Date(), s+10, "day"), 20, TRUE)
ls1 <- list(ss1 = ss1, ss2 = ss2)

定义函数:

f <- function(x, n) sort(table(x), decreasing = TRUE)[1:n]

对数据应用函数:

lapply(ls1, f, n = 3)
# $ss1
# x
# 2014-09-08 2014-09-09 2014-09-07 
#          3          3          2 
# 
# $ss2
# x
# 2014-09-10 2014-09-06 2014-09-07 
#          4          3          2 

【讨论】:

  • 我怀疑它是否适用于他的具体情况。每行不是单独条目的向量。在他的数据上尝试你的功能
  • 我无法读取他的数据。缺失的日期是否在单个字符串中?
  • 是的,缺少的日期是一个逗号分隔的字符串
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-01-25
  • 2021-02-18
  • 1970-01-01
  • 1970-01-01
  • 2023-03-28
相关资源
最近更新 更多