【问题标题】:How to aggregate tweets per minute如何聚合每分钟的推文
【发布时间】:2014-12-15 21:30:30
【问题描述】:

我做了一些有趣的推特挖掘。我使用 twitters streaming-APi 并在足球比赛之前、期间和之后记录了推文。现在我想做一个显示足球比赛推文频率的ggplot2-graph。

在原始数据框中,每条推文有一行,变量“created_at”包含如下日期:Sat Dec 13 13:04:34 +0000 2014

然后我像这样更改了时间格式

tweets$format

得到了这个2014-12-13 14:04:34 CET。因为我不需要日期,我想,我可以摆脱它

tweets$Uhrzeit

我只剩下时间了14:04:34

我的问题是,我想以每分钟推文的准确度来分析推文频率。我如何汇总每分钟的推文?正如我之前所说,每一行都是一条推文。我用时间和第二个包含“1”的变量制作了一个数据框。现在我想每分钟计算(聚合,求和)第二个变量。我试图找到一个解决方案,阅读有关 zoo-library 和 chron-library 的信息,但它让我感到困惑。

希望有人可以帮助我。


编辑:可重现的数据 数据框是其中的一个子集:names(tweets)

 [1] "X"                         "text"                      "retweet_count"            
 [4] "favorited"                 "truncated"                 "id_str"                   
 [7] "in_reply_to_screen_name"   "source"                    "retweeted"                
[10] "created_at"                "in_reply_to_status_id_str" "in_reply_to_user_id_str"  
[13] "lang"                      "listed_count"              "verified"                 
[16] "location"                  "user_id_str"               "description"              
[19] "geo_enabled"               "user_created_at"           "statuses_count"           
[22] "followers_count"           "favourites_count"          "protected"                
[25] "user_url"                  "name"                      "time_zone"                
[28] "user_lang"                 "utc_offset"                "friends_count"            
[31] "screen_name"               "country_code"              "country"                  
[34] "place_type"                "full_name"                 "place_name"               
[37] "place_id"                  "place_lat"                 "place_lon"                
[40] "lat"                       "lon"                       "expanded_url"             
[43] "url"                       "timeformat" 

我将“created_at”变量转换为“timeformat”变量,如下所示:

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1))
colnames(tweets.df)<-c("time","freq")

我刚刚绘制了数据。 stat="bin" 默认 bin 为数据范围的 1/30。每分钟拥有它会更好。

ggplot(tweets,aes(x=timeformat)) + geom_line(stat="bin")

【问题讨论】:

  • 如果您使用示例输入创建了reproducible example,这将更容易回答。
  • 我有一个解决方案,但我想要一个数据框示例,其中包含您希望确保我们在同一轨道上的输出。作为提示,我的想法是使用 dplyrPOSIXlt 这将使您可以访问 $hour(比 gsub 更容易)并使用 dplyr group_bysummarise
  • 听起来像是 table() 的工作,但如果没有样本数据就很难判断。
  • 通常会使用format.POSIXt 为您提供小时:分钟类别。还有一个round.POSIXt 通常很有帮助。
  • 我添加了示例数据。对不起,我一开始没有包括它。谢谢你帮助我!

标签: r datetime twitter zoo chron


【解决方案1】:

鉴于您的示例数据集:

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1), stringsAsFactors=FALSE)
colnames(tweets.df)<-c("time","freq")

首先,您的时间列包含文本字符串,您需要 POSIXct 对象:

tweets.df$time <- as.POSIXct(tweets.df$time)

然后,使用函数cut.POSIXt 完成按分钟分箱:

by.mins <- cut.POSIXt(tweets.df$time,"mins")

然后你想用这个分割你的数据框,并在子集上对列 freq 求和:

tweets.mins <- split(tweets.df, by.mins)
sapply(tweets.mins,function(x)sum(as.integer(x$freq)))
2014-12-13 14:04:00 2014-12-13 14:05:00 2014-12-13 14:06:00 2014-12-13 14:07:00 2014-12-13 14:08:00 
                  3                   3                   3                   0                   1 
2014-12-13 14:09:00 2014-12-13 14:10:00 2014-12-13 14:11:00 2014-12-13 14:12:00 2014-12-13 14:13:00 
                  2                   3                   2                   2                   0 
2014-12-13 14:14:00 2014-12-13 14:15:00 2014-12-13 14:16:00 2014-12-13 14:17:00 2014-12-13 14:18:00 
                 20                   2                   2                   4                   2 
2014-12-13 14:19:00 
                  1 

在这种情况下,由于freq 始终等于 1,因此相当于使用table(by.mins)

【讨论】:

    猜你喜欢
    • 2017-03-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多