如何聚合每分钟的推文答案

【问题标题】：How to aggregate tweets per minute如何聚合每分钟的推文
【发布时间】：2014-12-15 21:30:30
【问题描述】：

我做了一些有趣的推特挖掘。我使用 twitters streaming-APi 并在足球比赛之前、期间和之后记录了推文。现在我想做一个显示足球比赛推文频率的ggplot2-graph。

在原始数据框中，每条推文有一行，变量“created_at”包含如下日期：Sat Dec 13 13:04:34 +0000 2014

然后我像这样更改了时间格式

tweets$format

得到了这个2014-12-13 14:04:34 CET。因为我不需要日期，我想，我可以摆脱它

tweets$Uhrzeit

我只剩下时间了14:04:34。

我的问题是，我想以每分钟推文的准确度来分析推文频率。我如何汇总每分钟的推文？正如我之前所说，每一行都是一条推文。我用时间和第二个包含“1”的变量制作了一个数据框。现在我想每分钟计算（聚合，求和）第二个变量。我试图找到一个解决方案，阅读有关 zoo-library 和 chron-library 的信息，但它让我感到困惑。

希望有人可以帮助我。

编辑：可重现的数据数据框是其中的一个子集：names(tweets)

 [1] "X"                         "text"                      "retweet_count"            
 [4] "favorited"                 "truncated"                 "id_str"                   
 [7] "in_reply_to_screen_name"   "source"                    "retweeted"                
[10] "created_at"                "in_reply_to_status_id_str" "in_reply_to_user_id_str"  
[13] "lang"                      "listed_count"              "verified"                 
[16] "location"                  "user_id_str"               "description"              
[19] "geo_enabled"               "user_created_at"           "statuses_count"           
[22] "followers_count"           "favourites_count"          "protected"                
[25] "user_url"                  "name"                      "time_zone"                
[28] "user_lang"                 "utc_offset"                "friends_count"            
[31] "screen_name"               "country_code"              "country"                  
[34] "place_type"                "full_name"                 "place_name"               
[37] "place_id"                  "place_lat"                 "place_lon"                
[40] "lat"                       "lon"                       "expanded_url"             
[43] "url"                       "timeformat"

我将“created_at”变量转换为“timeformat”变量，如下所示：

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1))
colnames(tweets.df)<-c("time","freq")

我刚刚绘制了数据。 stat="bin" 默认 bin 为数据范围的 1/30。每分钟拥有它会更好。

ggplot(tweets,aes(x=timeformat)) + geom_line(stat="bin")

【问题讨论】：

如果您使用示例输入创建了reproducible example，这将更容易回答。
我有一个解决方案，但我想要一个数据框示例，其中包含您希望确保我们在同一轨道上的输出。作为提示，我的想法是使用 dplyr 和 POSIXlt 这将使您可以访问 $hour（比 gsub 更容易）并使用 dplyr group_by 和 summarise
听起来像是 table() 的工作，但如果没有样本数据就很难判断。
通常会使用format.POSIXt 为您提供小时：分钟类别。还有一个round.POSIXt 通常很有帮助。
我添加了示例数据。对不起，我一开始没有包括它。谢谢你帮助我！

标签： r datetime twitter zoo chron

【解决方案1】：

鉴于您的示例数据集：

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1), stringsAsFactors=FALSE)
colnames(tweets.df)<-c("time","freq")

首先，您的时间列包含文本字符串，您需要 POSIXct 对象：

tweets.df$time <- as.POSIXct(tweets.df$time)

然后，使用函数cut.POSIXt 完成按分钟分箱：

by.mins <- cut.POSIXt(tweets.df$time,"mins")

然后你想用这个分割你的数据框，并在子集上对列 freq 求和：

tweets.mins <- split(tweets.df, by.mins)
sapply(tweets.mins,function(x)sum(as.integer(x$freq)))
2014-12-13 14:04:00 2014-12-13 14:05:00 2014-12-13 14:06:00 2014-12-13 14:07:00 2014-12-13 14:08:00 
                  3                   3                   3                   0                   1 
2014-12-13 14:09:00 2014-12-13 14:10:00 2014-12-13 14:11:00 2014-12-13 14:12:00 2014-12-13 14:13:00 
                  2                   3                   2                   2                   0 
2014-12-13 14:14:00 2014-12-13 14:15:00 2014-12-13 14:16:00 2014-12-13 14:17:00 2014-12-13 14:18:00 
                 20                   2                   2                   4                   2 
2014-12-13 14:19:00 
                  1

在这种情况下，由于freq 始终等于 1，因此相当于使用table(by.mins)。

【讨论】：