在大数据中查找具有最大计数的时间间隔答案

【问题标题】：Find time interval with maximum count in big data在大数据中查找具有最大计数的时间间隔
【发布时间】：2017-01-19 13:39:13
【问题描述】：

我有一个庞大的数据框，其中包含数百万个电子邮件地址及其开放时间。以下是我的数据框的子集。

dput(droplevels(data))
structure(list(email_address_hash = structure(1:3, .Label = c("0004eca7b8bed22aaf4b320ad602505fe9fa9d26", 
"00198ee5364d73796e0e352f1d2576f8e8fa99db", "35c0ef2c2a804b44564fd4278a01ed25afd887f8"
), class = "factor"), open_times = c(" 04:39:24 10:39:43", " 21:12:04 07:05:23 06:31:24", 
" 09:57:20 19:00:09")), row.names = c(NA, -3L), .Names = c("email_address_hash", 
"open_times"), .internal.selfref = <pointer: 0x0000000007b60788>, class = c("data.table", 
"data.frame"))

我的数据框的结构是

str(data)
Classes ‘data.table’ and 'data.frame':  3 obs. of  2 variables:
 $ email_address_hash: Factor w/ 36231 levels "00012aec4ca3fa6f2f96cf97fc2a3440eacad30e",..: 2 16 7632
 $ open_times        : chr  " 04:39:24 10:39:43" " 21:12:04 07:05:23 06:31:24" " 09:57:20 19:00:09"
 - attr(*, ".internal.selfref")=<externalptr>

我要实现这两个目标

目标：-

1) 从 00:00:00 开始，每隔一小时计算我获得的每个客户的条目数。假设我们的第一个案例 open_times 的第一行是 04:39:24 和 10:39:43 。所以它得到一个计数 b/w 4:00:00- 5:00:00 和一个计数 b/w 10:00:00 和 11:00:00 以及所有其他间隔的计数为零，如 b/w 00： 00:00 和 01:00:00 等等。我只想要前两个具有最大条目数的计数。在这种情况下，它是 4:00:00-5:00:00 和 10:00:00-11:00:00 以及它们各自在其他列中的计数

2) 是否可以将时间间隔从 1 小时更改为 1.5 小时或 2 小时？

为了提供更多解释，下面是我想要的输出的图像请建议我一些有效的方法来解决这个问题，因为我有一个大数据。如果您有不清楚的地方，请告诉我，而不是对我的问题投反对票。

【问题讨论】：

@akrun 你能帮帮我吗

标签： r time data.table time-series

【解决方案1】：

首先，将数据重组为可用于汇总数据的长格式。此示例使用dplyr 包。

研究如何处理时间和日期，使其更加复杂。我只是将时间的不同组成部分从字符串中分离出来。

require(dplyr)
require(tidyr)

norm <- df %>% mutate(times=trimws(open_times)) %>% 
  separate(times,c('t1','t2','t3','t4'), sep = " ") %>%
  gather(key, value, -email_address_hash,-open_times) %>% 
  filter(!is.na(value)) %>%
  separate(value, into = c('hr','min','sec'), sep=":") 


norm %>%
  group_by(hr) %>% summarise(n = n())

结果

# A tibble: 7 × 2
hr     n
<chr> <int>
04     1
06     1
07     1
09     1
10     1
19     1
21     1

您可以使用不同的间隔计算组，如下所示：

interval <- 90

norm %>% 
  mutate(minutes = 60*as.numeric(hr)+as.numeric(min),
  group = (minutes-minutes%%interval)/interval) %>%
  group_by(group) %>% summarise(n = n())

我计算自午夜以来的分钟数，并使用该值组成 90 分钟（1.5 小时）的组。

这是归一化数据的结构：

> str(norm)
'data.frame':   7 obs. of  6 variables:
  $ email_address_hash: Factor w/ 3 levels "0004eca7b8bed22aaf4b320ad602505fe9fa9d26",..: 1 2 3 1 2 3 2
$ open_times        : chr  " 04:39:24 10:39:43" " 21:12:04 07:05:23 06:31:24" " 09:57:20 19:00:09" " 04:39:24 10:39:43" ...
$ key               : chr  "t1" "t1" "t1" "t2" ...
$ hr                : chr  "04" "21" "09" "10" ...
$ min               : chr  "39" "12" "57" "39" ...
$ sec               : chr  "24" "04" "20" "43" ...

要生成您在示例中添加的结果，您可以使用：

norm %>% 
  mutate(minutes = 60*as.numeric(hr)+as.numeric(min),
  group = floor(minutes/120)) %>% 
  mutate(label = paste0(group*2,":00-",group*2+2,":00" )) %>%
  group_by(email_address_hash, label) %>% summarise(n = n()) %>%
  spread(label, n)

结果：

email_address_hash `10:00-12:00` `18:00-20:00` `20:00-22:00` `4:00-6:00` `6:00-8:00`  `8:00-10:00`
<fctr>             <int>         <int>         <int>         <int>       <int>        <int>
0004eca7...        1             NA            NA             1          NA           NA
00198ee5...        NA            NA            1             NA           2           NA
35c0ef2c...        NA            1             NA            NA          NA            1

Allthought 不完全像你输出的例子。那是因为我不同意您正在寻找的数据结构。

【讨论】：

您将时间分成 hh:mm:ss 。我想要别的东西。如果我错了，请纠正我
也许如果您尝试这个脚本并查看规范结构，您可以轻松适应您想要的内容。否则请添加您想要的结果示例data.frame。
感谢您的回复。我已根据您的要求编辑了问题。如果这对您有意义，请告诉我